Set up a sentence classifier to identify positive sentences by datatype

WormBase / ACKnowledge

Author Curation to Knowledgebases

MIT License

1 stars 1 forks source link

Set up a sentence classifier to identify positive sentences by datatype #241

Closed valearna closed 12 months ago

valearna commented 2 years ago

Some ideas on how to design the classifier:

These solutions need a training set of positive and negative sentences. We could identify negative sentences by taking all sentences in the results section all curated papers that are not identified as positive in WB, but some positive sentences that are not curatable could remain in this set.

https://towardsdatascience.com/text-classification-with-no-model-training-935fe0e42180

This solution is similar to what we have already done with text similarity but uses bert (we could use biobert) to embed the sentences

valearna commented 2 years ago

More details here: https://docs.google.com/document/d/18eNO_D3Hj4dsoY9_Q0AsBBN_i3h0QFQXTCGQfgc65KA/edit

valearna commented 2 years ago

Created a new repository for sentence analysis and classification: https://github.com/WormBase/curation-sentence-classification

draciti commented 2 years ago

Negative training set here: https://docs.google.com/spreadsheets/d/1V-4Q_XizwYBMfl01Zj81wdFDBlQCmyElSRZ5GssKDNA/edit#gid=0

valearna commented 2 years ago

Validation set ready here: https://docs.google.com/spreadsheets/d/1ylyw7Qx3KNHK9-3vcappujgSdOjFLyfpNlJKNu4l9V4/edit#gid=1669548336

I extracted the 30 papers as defined in the gdoc (5 NN high not validated, 5 NN med, 5 NN low and 15 NN neg), preprocessed and cleaned their sentences and then took a random subset of 1000 sentences. For each sentence, I calculated the distance between the exp_pattern and subcellloc centroids and marked them as EXP_PATTERN_POSITIVE and SUBCELLLOC_POSITIVE if the cosine similarity between the sentence and the respective centroid was > 0.45 (the threshold yielding the best F1 score in the analysis). @draciti please add an additional column with your evaluation for EXP_PATTERN and SUBCELLLOC classification (True if positive, False if negative).

valearna commented 2 years ago

Added paper ids to validation set

draciti commented 12 months ago

this is done for gene expression and catalytic activity