Closed valearna closed 12 months ago
Created a new repository for sentence analysis and classification: https://github.com/WormBase/curation-sentence-classification
Negative training set here: https://docs.google.com/spreadsheets/d/1V-4Q_XizwYBMfl01Zj81wdFDBlQCmyElSRZ5GssKDNA/edit#gid=0
Validation set ready here: https://docs.google.com/spreadsheets/d/1ylyw7Qx3KNHK9-3vcappujgSdOjFLyfpNlJKNu4l9V4/edit#gid=1669548336
I extracted the 30 papers as defined in the gdoc (5 NN high not validated, 5 NN med, 5 NN low and 15 NN neg), preprocessed and cleaned their sentences and then took a random subset of 1000 sentences. For each sentence, I calculated the distance between the exp_pattern and subcellloc centroids and marked them as EXP_PATTERN_POSITIVE and SUBCELLLOC_POSITIVE if the cosine similarity between the sentence and the respective centroid was > 0.45 (the threshold yielding the best F1 score in the analysis). @draciti please add an additional column with your evaluation for EXP_PATTERN and SUBCELLLOC classification (True if positive, False if negative).
Added paper ids to validation set
this is done for gene expression and catalytic activity
Some ideas on how to design the classifier:
These solutions need a training set of positive and negative sentences. We could identify negative sentences by taking all sentences in the results section all curated papers that are not identified as positive in WB, but some positive sentences that are not curatable could remain in this set.
This solution is similar to what we have already done with text similarity but uses bert (we could use biobert) to embed the sentences