Closed ericpan64 closed 3 years ago
Workflow in order
[x] Download metadata.csv file from kaggle competition website. The 9th column in this file contains the abstracts. The file has about ~341k entries, so that many abstracts on covid 19. For simplicity now we can just use abstracts. Full text can be an extension/future thing mentioned in the conclusion
[x] Parse out the abstracts
pip install csvtool
see doc
cvstool -c 9 metadata.csv > abstracts.txt
[x] Python script using spaCy and sciSpacy models to extract biomedical entities. Pick a pretrained model to use.
models to consider here
now the hard part
[x] Aggregate the entities together across all the abstracts
[x] match up those with concept ids
This is the part I thought for the longest time.... I think the easiest way to do this is to also run the concept name column in the dictionary file through the sciSpacy model(s) so that each concept id can be linked to one or more entities.
similar to challenge with aggregating entities in abstracts, we may need to take shortcut like taking only one-word entities or do some smart fuzzy string matching to map the entities from abstract to concept ids
[x] identify concept ids that are important based on frequencies
[ ] and then finally use those features in our machine learning model
Updated the checklist, AFAIK from your update we can close this issue! (given the framework and initial results are set-up)
Let's open a separate, more specific issue for the next-steps during the meeting tomorrow
Goal: using the CORD dataset, write a script that aggregates word frequencies across the different texts (feel free to add/adjust analysis as you see fit). Incorporate a Python NLP library of your choice (e.g. spaCy, CoreNLP)