PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
28 stars 7 forks source link

No related papers results #742

Closed maxkfranz closed 4 years ago

maxkfranz commented 4 years ago

I tried using the related papers feature by submitting a simple paper in a local instance. The instance points to the public instance of the semantic-search service.

Here is the paper: MeCP2 Facilitates Breast Cancer Growth via Promoting Ubiquitination-Mediated P53 Degradation by Inhibiting RPL5/RPL11 Transcription. The Pubmed metadata is populated, so this issue seems to be distinct from #738.

The document is just 'MDM2 inhibits P53 via binding'. I know that the paper would have more interactions, but this single interaction should be sufficient for testing.

Screen Shot 2020-06-03 at 4 14 48 PM

There are errors in the console:

info:    Searching the related papers table for document  c95d7143-f950-4662-a4ad-d05677734783
info:    PATCH /api/document/status/c95d7143-f950-4662-a4ad-d05677734783/743fdc49-a8c6-4f81-a90d-6ae44bb7df49 500 251.431 ms - 142
[object Object]

And doc.relatedPapers() is undefined.

jvwong commented 4 years ago

I could reproduce the missing relatedPapers last night but this morning it works fine? Can you try again?

There is some somewhat related refactoring that needs to be done, (another PRs):

jvwong commented 4 years ago

Just a sec, the error is specific to your example with INDRA: human mdm2 and p53...

jvwong commented 4 years ago

My best guess is that the semantic-search instance isn't scaling well with the number of papers (aka request body documents) it is asked to process.

I was trying three cases:

jvwong commented 4 years ago

Tried to run semantic-search locally, and the ActivityMonitor said it was eating up nearly all %CPU.

maxkfranz commented 4 years ago

How about we use a simple filter after the indra stage, e.g. N most recent articles?

If the related papers feature is for discovery, you want papers that you have’t seen yet. Unless you’re already spending a lot of time and effort to keep on top of the latest research, there are probably several new and interesting papers that you haven’t seen yet.

So a chronological filter is probably fine. We don’t need the feature to show every paper that’s interesting: It just needs to show some papers that are interesting.

If we want to optimise for the large-N edge cases to allow for older papers too, we could address that later.

I’m OK with using a chronological filter, unless someone has in mind an alternate simple metric that would work better. Any ideas?

On Jun 4, 2020, at 12:46, Jeffrey notifications@github.com wrote:

 Tried to run semantic-search locally, and the ActivityMonitor said it was eating up nearly all %CPU.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

metincansiper commented 4 years ago

How about we use a simple filter after the indra stage, e.g. N most recent articles?

Sounds good. We had a flag for sorting by date. I think we would not need to do any filtering when that flag is set. However, when it is unset (by default) we can filter the N most recent articles first then can run the semantic search on these N articles as you mentioned.

BTW we also had a score threshold and we were not including the papers below the threshold. I think we can keep doing it.

jvwong commented 4 years ago

A little more direct testing of semantic search instances. I was using curl, insomnia (libcurl/7.69.1).

> POST / HTTP/2
> Host: semanticsearch.baderlab.org
> user-agent: insomnia/2020.2.1
> content-type: application/json
> accept: application/json
> content-length: 142247

* HTTP/2 stream 0 was not closed cleanly: Unknown error code (err 1)
* stopped the pause stream!

Similary with curl: curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

jvwong commented 4 years ago
jvwong commented 4 years ago
maxkfranz commented 4 years ago

Is the VM overloaded? It could be that more CPU and RAM should be allocated

On Jun 5, 2020, at 1:44 PM, Jeffrey notifications@github.com wrote:

Full docker setup running @ https://master.factoid.baderlab.org https://master.factoid.baderlab.org/ Submit: OK for > 1000 documents (~1.6s/doc) /api/document/related-papers/:id: Gateway Timeout — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/factoid/issues/742#issuecomment-639661086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHRO42T7DESSZBWBG5G733RVEVGJANCNFSM4NR76ULQ.

jvwong commented 4 years ago

Yeah, this thing is cpu-reliant:

IP Processor model name # cores RAM (MB)
X.X.X.174 (dev) Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz 4 32 175
X.X.X.195 (prod) Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz 2 128 942
maxkfranz commented 4 years ago

It's worth a shot to try increasing the number of cores per vm

jvwong commented 4 years ago

OK I can ask JL.

maxkfranz commented 4 years ago

I tried some other queries, and I just get empty results (i.e. [ ]). There may be issues other than server stress

maxkfranz commented 4 years ago

I think we should put together a small set of test cases to make sure we're all on the same page with our testing. A test case should always use the same paper and the same network topology. Two cases to start is probably enough to catch most issues:

(1) Minimum: The document has two entities and one interaction. The list of papers from Indra is small.

(2) Large: The document a larger number of interactions, perhaps three. The list of papers from Indra is large. Maybe this is a p53 paper.

jvwong commented 4 years ago

This is the test suite I've been using. I can email you the whole google sheet.

Table 1. Molecular Cell case studies

Species Pathway Participant (Source) Namespace Participant (Target) Namespace Type Pubmed INDRA
statements_returned total_evidence
Homo sapiens Antiviral stimulator of interferon genes (STING) SLC19A1 Text 2'3'-cGAMP Text Binding https://www.ncbi.nlm.nih.gov/pubmed/31126740 0 0
9606 6573 NCBI Gene 75947 ChEBI
Homo sapiens Mitochondrial metabolic adaption SENP1 Text Sirt3 Text De-SUMOylation (Activation) https://www.ncbi.nlm.nih.gov/pubmed/31302001 1 1
29843 NCBI Gene 23410 NCBI Gene
Homo sapiens Histone gene regulation HAT1 Text HIST1H4E Text Transcription; Binding; Modification https://www.ncbi.nlm.nih.gov/pubmed/31278053 0 0
8520 NCBI Gene 8367 NCBI Gene
Mus musculus MYC-E2F cell cycle MYC Text E2F1 Text Transcription https://www.ncbi.nlm.nih.gov/pubmed/21292160 18 184
10090 17869 NCBI Gene 13555 NCBI Gene
Homo sapiens Interferon signaling DAPK1 Text RIG-I Text Phosphorylation/inhibition https://www.ncbi.nlm.nih.gov/pubmed/28132841 5 11
1612 NCBI Gene 23586 NCBI Gene
Homo sapiens DNA Repair PRMT5 Text RUVBL1 Text Methylation https://www.ncbi.nlm.nih.gov/pubmed/28238654 4 21
10419 NCBI Gene 8607 NCBI Gene
Homo sapiens Autophagy PGK1 Text Beclin1 Text Phosphorylation/inhibition https://www.ncbi.nlm.nih.gov/pubmed/28238651 10 65
5230 NCBI Gene 8678 NCBI Gene
Arabidopsis Cold response CRPK1 Text 14-3-3 lambda Text Phosphorylation/activation https://www.ncbi.nlm.nih.gov/pubmed/28344081 0 0
830909 NCBI Gene 838236 NCBI Gene
Homo sapiens Glycolysis mTORC1 Text GSK3B Text Phosphorylation/inhibition https://www.ncbi.nlm.nih.gov/pubmed/29861159 0 0
2475(MTOR) NCBI Gene 2932 NCBI Gene
Homo sapiens Anoikis and metastases PLAG1 Text GDH1 Text Transcription https://www.ncbi.nlm.nih.gov/pubmed/29249655 2 2
5324 NCBI Gene 854557 NCBI Gene
Homo sapiens Warburg effect lincRNA-p21 Text HIF-1α Text Stabilization https://www.ncbi.nlm.nih.gov/pubmed/24316222 3 7
102800311 NCBI Gene 3091 NCBI Gene
Homo sapiens Microenvironment Remodeling CamK-A Text PNCK Text Binding/Activation https://www.ncbi.nlm.nih.gov/pubmed/30220561 1 1
85451 NCBI Gene 8536 NCBI Gene
Homo sapiens DNA recombination REC114 Text ANKRD31 Text Binding https://www.ncbi.nlm.nih.gov/pubmed/31003867 0 0
283677 NCBI Gene 256006 NCBI Gene
Homo sapiens nutrition starvation PHF5A Text KDM3A Text Expression/Activation https://www.ncbi.nlm.nih.gov/pubmed/31054974 1 1
84844 NCBI Gene 55818 NCBI Gene

Table 2. Classic pathways across model organisms

Species Pathway Participant (Source) Namespace Participant (Target) Namespace Type Pubmed INDRA
statements_returned total_evidence
Homo sapiens Transformating Growth Factoid beta signalling TGF-beta 1 Text Smad2 Text Phosphorylation https://www.ncbi.nlm.nih.gov/pubmed/9311995 5 842
7046 NCBI Gene 4087 NCBI Gene
Drosophila melanogaster Transformating Growth Factoid beta signalling Babo Text dSmad2 Text Phosphorylation https://www.ncbi.nlm.nih.gov/pubmed/9887103 2 2
35900 NCBI Gene 31738 NCBI Gene
Saccharomyces cervisiae Mating-Pheremone response Ste5 Text Ste11 Text Complex https://www.ncbi.nlm.nih.gov/pubmed/8062390 19 70
851680 NCBI Gene 851076 NCBI Gene
Escherichia coli SOS Response lexA Text recA Text Repression https://www.ncbi.nlm.nih.gov/pubmed/7027256 9 201
948544 NCBI Gene 947170 NCBI Gene
Caenorhabditis elegans Insulin-like signalling SGK-1 Text DAF-16 Text Phosphorylation https://www.ncbi.nlm.nih.gov/pubmed/15068796 5 31
181697 NCBI Gene 172981 NCBI Gene
Arabidopsis thaliana Ethylene Signaling CTR1 Text EIN2 Text Phosphorylation https://www.ncbi.nlm.nih.gov/pubmed/23132950 10 96
831748 NCBI Gene 831889 NCBI Gene
Danio rerio p53 apoptosis p53 Text bcl2L Text Transcription https://www.ncbi.nlm.nih.gov/pubmed/19204115 0 0
30590 NCBI Gene 114401 NCBI Gene
maxkfranz commented 4 years ago

Looks good. Do all of these cases correspond to documents with a single interaction?

It looks like Indra doesn't give good results in several cases. For example, I get a better result for the first case by searching the Pubmed API with SLC19A1 and cGAMP.

There are several of your cases for which Indra returns no results or very few results. The related papers feature wouldn't be very compelling if we show zero or one papers a lot of the time. Some of those failures may be on Indra, but I suspect that several are due to how we're using it.

Our approach is this:

  1. Get a list of papers by querying Indra: Query for papers that have at least one of the interactions within the factoid.
  2. Sort the list of papers using Semantic Search.

Is it truly reasonable to do a network-based query to get the initial set of papers? It seems as though using network overlap may be too strict. If my paper has a novel interaction, for example, I'm always going to get no related papers for my factoid.

metincansiper commented 4 years ago

Shall I go ahead and make implementation for filtering the most recent N papers before running the semantic search as @maxkfranz suggested?

jvwong commented 4 years ago

It's worth a shot to try increasing the number of cores per vm

The short story is, semanticsearch.baderlab.org works fine now with large numbers of docs (at least 1009 for p53-mdm2)...to a limit.

Long story, problems running things through the URL:

I also got JL to increase the number of cores to 4 now. This wasn't actually the problem, as you can see.

I guess you can see this poses a problem if this service gets called frequently or with large numbers of papers. The p53-mdm2 example takes around 16 minutes, or ~0.9s per doc. Also any CRON job hitting this would have to be considered.

maxkfranz commented 4 years ago

Shall I go ahead and make implementation for filtering the most recent N papers before running the semantic search as @maxkfranz suggested?

That sounds good. That feature will be useful to have now, and it will continue be useful even if we make other improvements to the system in future.

maxkfranz commented 4 years ago

All right. Now the sorting should be more reliable. The next problem to solve is the lack of initial results from Indra: Much of the time, we get one result or no results. In order to have an effective recommendation system, there should be several (e.g. six or more) results.

While the current results from Indra have high signal, they are insufficient alone. We need to cast a wider net to get a bigger list of initial papers. Some potential approaches for this may be:

Another consideration is that we should support relatedPapers data for each entity in addition to each interaction. Otherwise, we don't have much of value to show when a user clicks an entity in the explore view. This reinforces the need to query for related papers by entity.

So the new process would be something like this, with new steps in bold:

  1. New: Query for a list of initial papers by the list of entities in the factoid -- one query per entity.
  2. Query for a list of initial papers by the list of interactions in the factoid -- one query per interaction.
  3. New: The list of relatedPapers for an entity is the set for that entity from (1) sorted by semantic search.
  4. The list of relatedPapers for an interaction is the set for that interaction from (2) sorted by semantic search.
  5. The list of relatedPapers for a document is:
    1. The set of all initial papers for all interactions, sorted by semantic search.
    2. New: The set of all initial papers for all entities, sorted by semantic search.
    3. New: (5i) and (5ii) may be done separately, if the list of interaction results are put before the list of entity results. They may be done together, if we allow semantic search to sort all of the results rather than prioritising interaction results.

@metincansiper, do you think that this is a reasonable next implementation step? @metincansiper & @jvwong, do you have suggestions for how this process could be improved? The steps above would put more load semantic search, so we may need to consider its stability more.

jvwong commented 4 years ago

Another sort

I want to bring up an additional sort module that would help triage (possibly many more) articles.

Related paper workflows

What a 'named entity' sort entails

  1. Article representation

    • These would be a bag of normalized named entities. We could use John's suggestion from NCBI PubTator that provides, for one or more PubMed ID or PubMedCentral ID (full-text) a list of entities
    • gene (NCBI Gene ID)
    • chemical (MeSH)
    • disease (MeSH)
    • mutation (MeSH)
    • cell line (Cellosaurus)
  2. Similarity metric The idea is to calculate the overlap in named entities.

    • Split the entities into two groups
    • genes, chemicals
    • Metric: Jaccard Coefficient (JC)
    • other (disease, mutation, cell line)
    • Metric: Overlap Coefficient (OC)

Notes: PubTator has information about individual mentions which could impact weighting (i.e. this article mentioned SLC19A1 50 times).

  1. Sort / Filter As simple as ordering/filtering based on the similarity score when a query article is compared to each candidate article. Comparisons could also be made pair-wise, effectively generating an 'enrichment-map'-like network of papers which could be traversed etc.

Additional sources of articles

John also mentioned semantic scholar which provides (free) two other lists of articles for any given article:

maxkfranz commented 4 years ago

Those looks like some interesting approaches to address the initial N-articles filter -- to potentially replace the chronological approach. We could experiment with those alternative approaches once we have the fundamental steps, (1) through (5).

The additional sources may be useful for the idea of having multiple sections (or selections) of related papers, e.g.:

metincansiper commented 4 years ago

@metincansiper, do you think that this is a reasonable next implementation step?

@maxkfranz I tried to run the query (https://db.indra.bio/statements/from_agents?offset=0&agent0=MDM2) and see that it returns some results. This query (https://db.indra.bio/statements/from_agents?offset=0&agent1=MDM2) is also returns some results where the parameter for agent1 is set instead of agent0.

I can ask Indra team about if querying like that is totally okay.

metincansiper commented 4 years ago

I got the response from Indra team I can query the statements for a single entity as https://db.indra.bio/statements/from_agents?offset=0&agent=MDM2). Therefore, it looks doable. However, I guess we must decide on which way to choose in 5(iii)?

maxkfranz commented 4 years ago

Therefore, it looks doable. However, I guess we must decide on which way to choose in 5(iii)?

I think it's fine to just use the all-entities-together approach for (5iii), since it's closest to the existing approach and it's simpler.

Let's go with the simplest version of the steps, (1) through (5), and we can consider adding more sophisticated steps in a future version.

maxkfranz commented 4 years ago

Edited typo above: For (5iii), it's fine to just put all the papers from nodes and edges together in the same set.

jvwong commented 4 years ago

I feel like this one is done and more specific issues that extend this exist.