No related papers results

maxkfranz commented 4 years ago

I tried using the related papers feature by submitting a simple paper in a local instance. The instance points to the public instance of the semantic-search service.

Here is the paper: MeCP2 Facilitates Breast Cancer Growth via Promoting Ubiquitination-Mediated P53 Degradation by Inhibiting RPL5/RPL11 Transcription. The Pubmed metadata is populated, so this issue seems to be distinct from #738.

The document is just 'MDM2 inhibits P53 via binding'. I know that the paper would have more interactions, but this single interaction should be sufficient for testing.

There are errors in the console:

info:    Searching the related papers table for document  c95d7143-f950-4662-a4ad-d05677734783
info:    PATCH /api/document/status/c95d7143-f950-4662-a4ad-d05677734783/743fdc49-a8c6-4f81-a90d-6ae44bb7df49 500 251.431 ms - 142
[object Object]

And doc.relatedPapers() is undefined.

jvwong commented 4 years ago

I could reproduce the missing relatedPapers last night but this morning it works fine? Can you try again?

There is some somewhat related refactoring that needs to be done, (another PRs):

There's a missing await keyword in tryTweetingDoc. Tweeting without the secret/key is the source of the HTTP 500, but isn't caught in the try-catch
The submit UI success panel doesn't wait for server response, resulting in Tweet link missing

jvwong commented 4 years ago

Just a sec, the error is specific to your example with INDRA: human mdm2 and p53...

jvwong commented 4 years ago

My best guess is that the semantic-search instance isn't scaling well with the number of papers (aka request body documents) it is asked to process.

I was trying three cases:

A. Few papers (n=2)
- Input: DAPK1, RIG-I
- Result: OK, returns in seconds
B. Many papers (n=93)
- Input: MYC, E2F1
- Result: OK, returns in minutes
C. Waaaay too many papers (n=1029)
- Input: p53, mdm2
- Result: semantic-search returns error, not JSON, indra.js getSemanticScores throws

jvwong commented 4 years ago

Tried to run semantic-search locally, and the ActivityMonitor said it was eating up nearly all %CPU.

maxkfranz commented 4 years ago

How about we use a simple filter after the indra stage, e.g. N most recent articles?

If the related papers feature is for discovery, you want papers that you have’t seen yet. Unless you’re already spending a lot of time and effort to keep on top of the latest research, there are probably several new and interesting papers that you haven’t seen yet.

So a chronological filter is probably fine. We don’t need the feature to show every paper that’s interesting: It just needs to show some papers that are interesting.

If we want to optimise for the large-N edge cases to allow for older papers too, we could address that later.

I’m OK with using a chronological filter, unless someone has in mind an alternate simple metric that would work better. Any ideas?

On Jun 4, 2020, at 12:46, Jeffrey notifications@github.com wrote:

Tried to run semantic-search locally, and the ActivityMonitor said it was eating up nearly all %CPU.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

metincansiper commented 4 years ago

How about we use a simple filter after the indra stage, e.g. N most recent articles?

Sounds good. We had a flag for sorting by date. I think we would not need to do any filtering when that flag is set. However, when it is unset (by default) we can filter the N most recent articles first then can run the semantic search on these N articles as you mentioned.

BTW we also had a score threshold and we were not including the papers below the threshold. I think we can keep doing it.

jvwong commented 4 years ago

A little more direct testing of semantic search instances. I was using curl, insomnia (libcurl/7.69.1).

Local semantic-search
- This is fine with > 1000 documents. It looks processing scales linearly ~0.35s/document.
semanticsearch.baderlab.org
- I tried 10 docs and it was processing at about 1.9s/document
- When I tried 100 documents, I was getting this error from insomnia

> POST / HTTP/2
> Host: semanticsearch.baderlab.org
> user-agent: insomnia/2020.2.1
> content-type: application/json
> accept: application/json
> content-length: 142247

* HTTP/2 stream 0 was not closed cleanly: Unknown error code (err 1)
* stopped the pause stream!

Similary with curl: curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

jvwong commented 4 years ago

Local, full Docker setup via /api/document/related-papers/:id or submit
- OK for > 1000 documents (~0.4s/doc)

jvwong commented 4 years ago

Full docker setup running @ https://master.factoid.baderlab.org
- Submit: OK for > 1000 documents (~0.6s/doc)
- /api/document/related-papers/:id: Gateway Timeout

maxkfranz commented 4 years ago

Is the VM overloaded? It could be that more CPU and RAM should be allocated

On Jun 5, 2020, at 1:44 PM, Jeffrey notifications@github.com wrote:

Full docker setup running @ https://master.factoid.baderlab.org https://master.factoid.baderlab.org/ Submit: OK for > 1000 documents (~1.6s/doc) /api/document/related-papers/:id: Gateway Timeout — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/factoid/issues/742#issuecomment-639661086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHRO42T7DESSZBWBG5G733RVEVGJANCNFSM4NR76ULQ.

jvwong commented 4 years ago

Yeah, this thing is cpu-reliant:

Local factoid, calls semantic-search running on X.X.X.174:8000 (same used by https://master.factoid.baderlab.org):
- Submit OK for > 1000 documents (~0.6s/doc)

IP	Processor model name	# cores	RAM (MB)
X.X.X.174 (dev)	Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz	4	32 175
X.X.X.195 (prod)	Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz	2	128 942

maxkfranz commented 4 years ago

It's worth a shot to try increasing the number of cores per vm

jvwong commented 4 years ago

OK I can ask JL.

maxkfranz commented 4 years ago

I tried some other queries, and I just get empty results (i.e. [ ]). There may be issues other than server stress

maxkfranz commented 4 years ago

I think we should put together a small set of test cases to make sure we're all on the same page with our testing. A test case should always use the same paper and the same network topology. Two cases to start is probably enough to catch most issues:

(1) Minimum: The document has two entities and one interaction. The list of papers from Indra is small.

(2) Large: The document a larger number of interactions, perhaps three. The list of papers from Indra is large. Maybe this is a p53 paper.

jvwong commented 4 years ago

This is the test suite I've been using. I can email you the whole google sheet.

Table 1. Molecular Cell case studies

Species	Pathway	Participant (Source)	Namespace	Participant (Target)	Namespace	Type	Pubmed	INDRA
								statements_returned	total_evidence
Homo sapiens	Antiviral stimulator of interferon genes (STING)	SLC19A1	Text	2'3'-cGAMP	Text	Binding	https://www.ncbi.nlm.nih.gov/pubmed/31126740	0	0
9606		6573	NCBI Gene	75947	ChEBI
Homo sapiens	Mitochondrial metabolic adaption	SENP1	Text	Sirt3	Text	De-SUMOylation (Activation)	https://www.ncbi.nlm.nih.gov/pubmed/31302001	1	1
		29843	NCBI Gene	23410	NCBI Gene
Homo sapiens	Histone gene regulation	HAT1	Text	HIST1H4E	Text	Transcription; Binding; Modification	https://www.ncbi.nlm.nih.gov/pubmed/31278053	0	0
		8520	NCBI Gene	8367	NCBI Gene
Mus musculus	MYC-E2F cell cycle	MYC	Text	E2F1	Text	Transcription	https://www.ncbi.nlm.nih.gov/pubmed/21292160	18	184
10090		17869	NCBI Gene	13555	NCBI Gene
Homo sapiens	Interferon signaling	DAPK1	Text	RIG-I	Text	Phosphorylation/inhibition	https://www.ncbi.nlm.nih.gov/pubmed/28132841	5	11
		1612	NCBI Gene	23586	NCBI Gene
Homo sapiens	DNA Repair	PRMT5	Text	RUVBL1	Text	Methylation	https://www.ncbi.nlm.nih.gov/pubmed/28238654	4	21
		10419	NCBI Gene	8607	NCBI Gene
Homo sapiens	Autophagy	PGK1	Text	Beclin1	Text	Phosphorylation/inhibition	https://www.ncbi.nlm.nih.gov/pubmed/28238651	10	65
		5230	NCBI Gene	8678	NCBI Gene
Arabidopsis	Cold response	CRPK1	Text	14-3-3 lambda	Text	Phosphorylation/activation	https://www.ncbi.nlm.nih.gov/pubmed/28344081	0	0
		830909	NCBI Gene	838236	NCBI Gene
Homo sapiens	Glycolysis	mTORC1	Text	GSK3B	Text	Phosphorylation/inhibition	https://www.ncbi.nlm.nih.gov/pubmed/29861159	0	0
		2475(MTOR)	NCBI Gene	2932	NCBI Gene
Homo sapiens	Anoikis and metastases	PLAG1	Text	GDH1	Text	Transcription	https://www.ncbi.nlm.nih.gov/pubmed/29249655	2	2
		5324	NCBI Gene	854557	NCBI Gene
Homo sapiens	Warburg effect	lincRNA-p21	Text	HIF-1α	Text	Stabilization	https://www.ncbi.nlm.nih.gov/pubmed/24316222	3	7
		102800311	NCBI Gene	3091	NCBI Gene
Homo sapiens	Microenvironment Remodeling	CamK-A	Text	PNCK	Text	Binding/Activation	https://www.ncbi.nlm.nih.gov/pubmed/30220561	1	1
		85451	NCBI Gene	8536	NCBI Gene
Homo sapiens	DNA recombination	REC114	Text	ANKRD31	Text	Binding	https://www.ncbi.nlm.nih.gov/pubmed/31003867	0	0
		283677	NCBI Gene	256006	NCBI Gene
Homo sapiens	nutrition starvation	PHF5A	Text	KDM3A	Text	Expression/Activation	https://www.ncbi.nlm.nih.gov/pubmed/31054974	1	1
		84844	NCBI Gene	55818	NCBI Gene

Table 2. Classic pathways across model organisms

Species	Pathway	Participant (Source)	Namespace	Participant (Target)	Namespace	Type	Pubmed	INDRA
								statements_returned	total_evidence
Homo sapiens	Transformating Growth Factoid beta signalling	TGF-beta 1	Text	Smad2	Text	Phosphorylation	https://www.ncbi.nlm.nih.gov/pubmed/9311995	5	842
		7046	NCBI Gene	4087	NCBI Gene
Drosophila melanogaster	Transformating Growth Factoid beta signalling	Babo	Text	dSmad2	Text	Phosphorylation	https://www.ncbi.nlm.nih.gov/pubmed/9887103	2	2
		35900	NCBI Gene	31738	NCBI Gene
Saccharomyces cervisiae	Mating-Pheremone response	Ste5	Text	Ste11	Text	Complex	https://www.ncbi.nlm.nih.gov/pubmed/8062390	19	70
		851680	NCBI Gene	851076	NCBI Gene
Escherichia coli	SOS Response	lexA	Text	recA	Text	Repression	https://www.ncbi.nlm.nih.gov/pubmed/7027256	9	201
		948544	NCBI Gene	947170	NCBI Gene
Caenorhabditis elegans	Insulin-like signalling	SGK-1	Text	DAF-16	Text	Phosphorylation	https://www.ncbi.nlm.nih.gov/pubmed/15068796	5	31
		181697	NCBI Gene	172981	NCBI Gene
Arabidopsis thaliana	Ethylene Signaling	CTR1	Text	EIN2	Text	Phosphorylation	https://www.ncbi.nlm.nih.gov/pubmed/23132950	10	96
		831748	NCBI Gene	831889	NCBI Gene
Danio rerio	p53 apoptosis	p53	Text	bcl2L	Text	Transcription	https://www.ncbi.nlm.nih.gov/pubmed/19204115	0	0
		30590	NCBI Gene	114401	NCBI Gene

maxkfranz commented 4 years ago

Looks good. Do all of these cases correspond to documents with a single interaction?

It looks like Indra doesn't give good results in several cases. For example, I get a better result for the first case by searching the Pubmed API with SLC19A1 and cGAMP.

There are several of your cases for which Indra returns no results or very few results. The related papers feature wouldn't be very compelling if we show zero or one papers a lot of the time. Some of those failures may be on Indra, but I suspect that several are due to how we're using it.

Our approach is this:

Get a list of papers by querying Indra: Query for papers that have at least one of the interactions within the factoid.
Sort the list of papers using Semantic Search.

Is it truly reasonable to do a network-based query to get the initial set of papers? It seems as though using network overlap may be too strict. If my paper has a novel interaction, for example, I'm always going to get no related papers for my factoid.

metincansiper commented 4 years ago

Shall I go ahead and make implementation for filtering the most recent N papers before running the semantic search as @maxkfranz suggested?

jvwong commented 4 years ago

It's worth a shot to try increasing the number of cores per vm

The short story is, semanticsearch.baderlab.org works fine now with large numbers of docs (at least 1009 for p53-mdm2)...to a limit.

Long story, problems running things through the URL:

Problem: 413 Request Entity too Large
- Solution: Bump up Nginx client_max_body_size to 100m
Problem: HTTP/2 stream 0 was not closed: PROTCOL_ERROR
- Force or use HTTP/1.1 on curl, node-app uses this by default
Problem: 504 Gateway Timeout
- Solution: Bump up Nginx timeout setting to 30 minutes

I also got JL to increase the number of cores to 4 now. This wasn't actually the problem, as you can see.

I guess you can see this poses a problem if this service gets called frequently or with large numbers of papers. The p53-mdm2 example takes around 16 minutes, or ~0.9s per doc. Also any CRON job hitting this would have to be considered.

maxkfranz commented 4 years ago

Shall I go ahead and make implementation for filtering the most recent N papers before running the semantic search as @maxkfranz suggested?

That sounds good. That feature will be useful to have now, and it will continue be useful even if we make other improvements to the system in future.

maxkfranz commented 4 years ago

All right. Now the sorting should be more reliable. The next problem to solve is the lack of initial results from Indra: Much of the time, we get one result or no results. In order to have an effective recommendation system, there should be several (e.g. six or more) results.

While the current results from Indra have high signal, they are insufficient alone. We need to cast a wider net to get a bigger list of initial papers. Some potential approaches for this may be:

(a) Query Indra for the set of entities in the factoid.
(b) Query alternative systems (Pubmed, Crossref, etc.) with the set of entities in the factoid.

Another consideration is that we should support relatedPapers data for each entity in addition to each interaction. Otherwise, we don't have much of value to show when a user clicks an entity in the explore view. This reinforces the need to query for related papers by entity.

So the new process would be something like this, with new steps in bold:

New: Query for a list of initial papers by the list of entities in the factoid -- one query per entity.
Query for a list of initial papers by the list of interactions in the factoid -- one query per interaction.
New: The list of relatedPapers for an entity is the set for that entity from (1) sorted by semantic search.
The list of relatedPapers for an interaction is the set for that interaction from (2) sorted by semantic search.
The list of relatedPapers for a document is:
1. The set of all initial papers for all interactions, sorted by semantic search.
2. New: The set of all initial papers for all entities, sorted by semantic search.
3. New: (5i) and (5ii) may be done separately, if the list of interaction results are put before the list of entity results. They may be done together, if we allow semantic search to sort all of the results rather than prioritising interaction results.

@metincansiper, do you think that this is a reasonable next implementation step? @metincansiper & @jvwong, do you have suggestions for how this process could be improved? The steps above would put more load semantic search, so we may need to consider its stability more.

jvwong commented 4 years ago

Another sort

I want to bring up an additional sort module that would help triage (possibly many more) articles.

Related paper workflows

Existing:
- (A) Biofactoid document
- (B) INDRA
- (C) PubMed EFETCH
- (D) Sort (date)
- (E) Sort (semantic-search)
Amended:
- (A) Biofactoid document
- (B) INDRA
- (C) Sort (named-entity)
- (D) PubMed EFETCH
- (E) Sort (date)
- (F) Sort (semantic-search)

What a 'named entity' sort entails

Article representation
- These would be a bag of normalized named entities. We could use John's suggestion from NCBI PubTator that provides, for one or more PubMed ID or PubMedCentral ID (full-text) a list of entities
- gene (NCBI Gene ID)
- chemical (MeSH)
- disease (MeSH)
- mutation (MeSH)
- cell line (Cellosaurus)
Similarity metric The idea is to calculate the overlap in named entities.
- Split the entities into two groups
- genes, chemicals
- Metric: Jaccard Coefficient (JC)
- other (disease, mutation, cell line)
- Metric: Overlap Coefficient (OC)

Notes: PubTator has information about individual mentions which could impact weighting (i.e. this article mentioned SLC19A1 50 times).

Sort / Filter As simple as ordering/filtering based on the similarity score when a query article is compared to each candidate article. Comparisons could also be made pair-wise, effectively generating an 'enrichment-map'-like network of papers which could be traversed etc.

Additional sources of articles

John also mentioned semantic scholar which provides (free) two other lists of articles for any given article:

Cited by (incoming)
References (outgoing)

maxkfranz commented 4 years ago

Those looks like some interesting approaches to address the initial N-articles filter -- to potentially replace the chronological approach. We could experiment with those alternative approaches once we have the fundamental steps, (1) through (5).

The additional sources may be useful for the idea of having multiple sections (or selections) of related papers, e.g.:

Recommended articles
Recent articles
Cited articles
Referenced articles

metincansiper commented 4 years ago

@metincansiper, do you think that this is a reasonable next implementation step?

@maxkfranz I tried to run the query (https://db.indra.bio/statements/from_agents?offset=0&agent0=MDM2) and see that it returns some results. This query (https://db.indra.bio/statements/from_agents?offset=0&agent1=MDM2) is also returns some results where the parameter for agent1 is set instead of agent0.

I can ask Indra team about if querying like that is totally okay.

metincansiper commented 4 years ago

I got the response from Indra team I can query the statements for a single entity as https://db.indra.bio/statements/from_agents?offset=0&agent=MDM2). Therefore, it looks doable. However, I guess we must decide on which way to choose in 5(iii)?

maxkfranz commented 4 years ago

Therefore, it looks doable. However, I guess we must decide on which way to choose in 5(iii)?

I think it's fine to just use the all-entities-together approach for (5iii), since it's closest to the existing approach and it's simpler.

Let's go with the simplest version of the steps, (1) through (5), and we can consider adding more sophisticated steps in a future version.

maxkfranz commented 4 years ago

Edited typo above: For (5iii), it's fine to just put all the papers from nodes and edges together in the same set.

jvwong commented 4 years ago

I feel like this one is done and more specific issues that extend this exist.

PathwayCommons / factoid