cj2001 / neo4j-gds-book

0 stars 0 forks source link

ArXiv blog post #1

Open tomasonjo opened 3 years ago

tomasonjo commented 3 years ago

And idea for a blog post series would be the following:

Use the Kaggle Arxiv Dataset: https://www.kaggle.com/Cornell-University/arxiv Import only one category of articles. Extract scientific concepts with NLP. Create a timeline of how the field progressed over time looking at scientific concepts. Find clusters of related concepts (Similarity + Community detection algos) Visualize results in Bloom (1.5 has a new feature for coloring of communities)

cj2001 commented 3 years ago

I have started working on this. Right now I am playing with the NLP (using spacy right now due to hopeful GPU acceleration I can get out of it) and will chime in when I have something to show.

tomasonjo commented 3 years ago

Default NLP models kind of suck at extracting scientific concepts... I'll try to get the finetuned https://github.com/allenai/scibert working again on https://github.com/allenai/scibert/tree/master/data/ner/sciie dataset.

If you are experienced with training NLP models in Spacy, that would be also cool!!!

cj2001 commented 3 years ago

Agreed re: NLP + technical terms. Maybe if we are going to do some sort of NLP as an input to the graph we might want to consider a different graph.

tomasonjo commented 3 years ago

We could do one generic with news dataset... For ArXiv we could take a look at biomedical world, there are like a trillion pre-trained models for that... probably because of covid:

https://github.com/dmis-lab/biobert https://www.johnsnowlabs.com/spark-nlp-health/

Also, I think that SpaCy v3 will support transformers, so we could use any of those models as well: https://huggingface.co/models?filter=token-classification

I'll work on this next week, hopefully, something cool will come out. Probably have to finetune a model using SciIIE or SciERC dataset... perhaps -> https://github.com/markus-eberts/spert