PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
28 stars 7 forks source link

Novel interaction notification #858

Closed maxkfranz closed 3 years ago

maxkfranz commented 4 years ago

Description

Q: What is the name of the feature?

A: Novel interaction notification

Q: What does this feature enable the user to do?

A: User X gets a notification when a novel interaction has been listed in Biofactoid and that interaction is relevant to User X's research.

Benefits:

Q: What information must the user provide to use the feature?

A: A different author, User A, creates a factoid with a novel interaction. User X shouldn't have to provide any information.

Q: What are the applicable constraints, e.g. compatibility or performance?

A:

Q: How does this feature affect each class of user (persona)?

A:

Specification

Mockup

(Email to User X)

Biofactoid has found a novel interaction that we have determined to be relevant to your research using Biofactoid's advanced AI. Biofactoid is an app that allows authors to create a digital profile of scientific discoveries in an article and connects it to related research.

You can see the new interaction here: https://biofactoid.org/document/{DOCUMENT_ID}

You can connect your own findings with other researchers by adding your articles to Biofactoid. Get started at biofactoid.org!

Details

maxkfranz commented 4 years ago

@jvwong, we should iterate on the email template mockup a bit.

maxkfranz commented 4 years ago

@cannin, this would be a great motivation to work on creating a more accurate script that can flag articles that are 'factoidable'. See the 'bonus' item under 'Details'.

jvwong commented 4 years ago

@cannin, this would be a great motivation to work on creating a more accurate script that can flag articles that are 'factoidable'. See the 'bonus' item under 'Details'.

My todo list includes creating a mini-test set of PMIDs of which a subset I flagged as hits. Mainly from Mol Cell, Cell reports (~200 hits for ? total)

maxkfranz commented 3 years ago

Adding @JohnGiorgi into the loop re. the 'factoidable' article detection

JohnGiorgi commented 3 years ago

@maxkfranz Cool!

I started prototyping something here. It uses AutoML to build/train a classifier for predicting "factoidable"/"not factoidable" based on @jvwong labelled data.

Accuracy is 100% on train set and 84% on the held-out test set, but @jvwong found the precision is actually much lower. The test set data is not balanced so accuracy is not appropriate.

A very informative experiement would be to train the model on fractions of our availble training data (e.g. 25%, 50%, 75%, 100%) and plot the performance on the held-out test set. I suspect we have far to little labelled data to take advantage of AutoML right now. I can try to get to this experiement this week.

JohnGiorgi commented 3 years ago

Okay here is that plot:

image

So performance on the train and validation set improves as more data is used to train the system. Weird that performance hits 100% on the validation set. Ill chalk this up to the fact that it is tiny (~10 examples).

Either way, this motivates collecting more labelled data if you want to go the AutoML route.

gbader commented 3 years ago

Hi - there are many existing training data that we could use that I'm sure would be roughly equivalent to the types of papers classified by Jeff. So we shouldn't need to create our own. FYI, we made something like this for a similar purpose, for protein interaction papers in 2003 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-4-11

JohnGiorgi commented 3 years ago

@gbader I didn't think about using PPI IE data. Does that study have abstracts with binary labels? (contains PPI or doesn't). Is there somewhere I can access it?

maxkfranz commented 3 years ago

@jvwong, would you update the unstable instance's environment variables so that it sends out the emails to the support address?

jvwong commented 3 years ago

@jvwong, would you update the unstable instance's environment variables so that it sends out the emails to the support address?

done.

jvwong commented 3 years ago
maxkfranz commented 3 years ago

Added date filtering and rate-limiting to the todos.

maxkfranz commented 3 years ago

Re. rate limiting: https://github.com/sindresorhus/p-throttle

jvwong commented 3 years ago

Re Rate limiting, I think that the ranking and filtering does a decent job of cutting down from what I've seen. Also the limits for MailJet are per hour. So maybe punt until we see it being a problem.

maxkfranz commented 3 years ago

Another filter would be (if we’re not already):

Intersect the list of papers (from refs) with the related papers (30) shortlist. That’s a simple way to enforce that the email papers have a high score.

On Nov 12, 2020, at 11:40, Jeffrey notifications@github.com wrote:

 Re Rate limiting, I think that the ranking and filtering does a decent job of cutting down from what I've seen. Also the limits for MailJet are per hour. So maybe punt until we see it being a problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

maxkfranz commented 3 years ago

Re. rate limiting, I just want to have a concrete plan in case things go sideways. Let’s see how the numbers work in practice before we take action

On Nov 12, 2020, at 11:40, Jeffrey notifications@github.com wrote:

 Re Rate limiting, I think that the ranking and filtering does a decent job of cutting down from what I've seen. Also the limits for MailJet are per hour. So maybe punt until we see it being a problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

maxkfranz commented 3 years ago

Closing. Enhancement such as one-click editor links can be assigned to new issues