Open tomasonjo opened 3 years ago
I have started working on this. Right now I am playing with the NLP (using spacy
right now due to hopeful GPU acceleration I can get out of it) and will chime in when I have something to show.
Default NLP models kind of suck at extracting scientific concepts... I'll try to get the finetuned https://github.com/allenai/scibert working again on https://github.com/allenai/scibert/tree/master/data/ner/sciie dataset.
If you are experienced with training NLP models in Spacy, that would be also cool!!!
Agreed re: NLP + technical terms. Maybe if we are going to do some sort of NLP as an input to the graph we might want to consider a different graph.
We could do one generic with news dataset... For ArXiv we could take a look at biomedical world, there are like a trillion pre-trained models for that... probably because of covid:
https://github.com/dmis-lab/biobert https://www.johnsnowlabs.com/spark-nlp-health/
Also, I think that SpaCy v3 will support transformers, so we could use any of those models as well: https://huggingface.co/models?filter=token-classification
I'll work on this next week, hopefully, something cool will come out. Probably have to finetune a model using SciIIE or SciERC dataset... perhaps -> https://github.com/markus-eberts/spert
And idea for a blog post series would be the following:
Use the Kaggle Arxiv Dataset: https://www.kaggle.com/Cornell-University/arxiv Import only one category of articles. Extract scientific concepts with NLP. Create a timeline of how the field progressed over time looking at scientific concepts. Find clusters of related concepts (Similarity + Community detection algos) Visualize results in Bloom (1.5 has a new feature for coloring of communities)