cannin / enhance_nlp_interaction_network_gsoc2020

3 stars 4 forks source link

Retrieve Information for Reactome Failed Searches #7

Open cannin opened 4 years ago

cannin commented 4 years ago

The table with MeSH terms was very helpful. We will continue connecting the pieces you have created.

Query Input

This table has queries for which there was no information in Reactome (i.e., failed queries): https://raw.githubusercontent.com/cannin/reach-query/master/queries.csv

Search Query

Take the first 10 QUERY terms and retrieve PubMed articles with the query string:

"QUERY" AND hasabstract

Output

Create the same table as #6 Add additional columns from PubMed: DOI, citation information, INDRA statement count. Please use XPath expressions when you can in your code.

XPath expression

//PubmedData/ArticleIdList/ArticleId[@IdType="doi"]

Citation information

Add another column with citation information (https://www.ncbi.nlm.nih.gov/pmc/tools/cites-citedby/) called PMC_CITATION_COUNT

Example EUtils Call

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&linkname=pubmed_pmc_refs&id=21876726&id=21876761&tool=my_tool&email=my_email@example.com

INDRA Count

Add counts for INDRA column called INDRA_STATEMENT_COUNT. Sample code below:

import os 
from indra.sources import indra_db_rest
from indra.assemblers.html.assembler import HtmlAssembler

os.environ["INDRA_DB_REST_URL"] = "SEE_EMAIL"
ha = HtmlAssembler()

out = indra_db_rest.get_statements_for_paper([('pmid','28642194')])
out.statements
cannin commented 4 years ago

@PritiShaw at most limit yourself to 10 PMIDs per QUERY term easily set parameter

PritiShaw commented 4 years ago

Thanks, mentor.

https://raw.githubusercontent.com/cannin/reach-query/master/queries.csv

In the queries.csv, there's a term ???, is it a valid term, or should I ignore?

PritiShaw commented 4 years ago

Hi Mentor, I have added the columns in the TSV file. You can find the output and the code in the gist. The citation counts and INDRA statements are very less, did you expect such output?

Thanks

cannin commented 4 years ago

I did not know what to expect; I will need to look.

cannin commented 4 years ago

@PritiShaw see Gist for comments

cannin commented 4 years ago

What does this mean? Do you have a special package to fill this in? Or manually?

os.environ["INDRA_DB_REST_URL"] = "***********"
cannin commented 4 years ago

@PritiShaw thanks for the update. how long does this take to run? can you set this up to run as long as possible on Google Colab (12 hour limit)? Save it frequently to somewhere else. I've used the Python Dropbox page, but even pushing to the Gist would be good.

PritiShaw commented 4 years ago
os.environ["INDRA_DB_REST_URL"] = "***********"

I manually removed the API URL before making the gist

I have made a repository with the code, you can find the output file here , same as gist(updated) For MESH Extraction, in this repository I have written steps using the Web Interface for batch extraction instead of JAVA code, for simplicity.

how long does this take to run?

It took around 10 minutes for processing 50 queries, Following two steps take time (1>2)

  1. MESH Extraction: Time taken depends on performance by NLM server
  2. Extraction Journal information using EUtils: Too Many Requests error is received, hence waiting time is added

Both can be run parallel as they are independent, but the last step requires output files generated by both these steps, Hence to run it on large scale, we will have to make chunks of the input list of query terms and proceed accordingly. Should I start making the script for that?

Thanks

cannin commented 4 years ago

@PritiShaw Yes, please. I'm finding the results interesting for those that are not zero for INDRA. I'm not sure what the status of the INDRA text link PR is, but how could I use your code easily?

PritiShaw commented 4 years ago

@PritiShaw Yes, please. I'm finding the results interesting for those that are not zero for INDRA. I'm not sure what the status of the INDRA text link PR is, but how could I use your code easily?

Sure I will start working on it

Regarding the PR of "Scroll to text fragment" in Indra, it has been merged today :) https://github.com/sorgerlab/indra/pull/1120 , Documentation This code should help you, ( you only gave me this code 😅 )

PritiShaw commented 4 years ago

@PritiShaw Yes, please. I'm finding the results interesting for those that are not zero for INDRA. I'm not sure what the status of the INDRA text link PR is, but how could I use your code easily?

Sure I will start working on it

Regarding the PR of "Scroll to text fragment" in Indra, it has been merged today :) sorgerlab/indra#1120 , Documentation This code should help you, ( you only gave me this code 😅 )

I have started processing on every term, output tsv (in progress)

I have also added columns from issue #9 #10

PritiShaw commented 4 years ago

I have started processing on every term, output tsv (in progress)

I have also added columns from issue #9 #10

Till now 170 terms have been processed The average time taken per 10 terms is 7.5 minutes

There are total 101,648 terms, therefore

Total time = 101648*.75 minutes
           = 76236 minutes
           = 1270.6 hours
           = 53 days

I have used parallel processing for MESH terms and getting Metadata from EUtils.

cannin commented 4 years ago

@PritiShaw Thanks. It is an excessively long list. What if we keep it to >= 10 hits for now? Also, I'm not sure if the list I gave you is properly sorted. 

cannin commented 4 years ago

@PritiShaw I quickly looked through the output it looks good. Thanks.

PritiShaw commented 4 years ago

@PritiShaw Thanks. It is an excessively long list. What if we keep it to >= 10 hits for now? Also, I'm not sure if the list I gave you is properly sorted.

No the CSV is not sorted Now I have sorted it manually (descending Hits), and I am processing only those terms with >=10 hits from commit https://github.com/PritiShaw/Reactome-Failed-Queries-Processing/commit/a8d5dbacd13769667e5a2e8c154553f9cf8bdc2c

I am adding in previous TSV file itself, should I restart?

There are around 4520 such terms, average till now is 100s per term

Thanks

cannin commented 4 years ago

You can restart, this is all a prototype still; the past output is still valuable. How many articles do you process per query term? Everything that PubMed returns or do you have a limit? For example: "ELK4" has 24 entries, but PubMed has 116 results. If you are not processing every paper that has an abstract, can you put another column PMID_COUNT with the result count from PubMed?

PritiShaw commented 4 years ago

You can restart, this is all a prototype still; the past output is still valuable. How many articles do you process per query term? Everything that PubMed returns or do you have a limit? For example: "ELK4" has 24 entries, but PubMed has 116 results. If you are not processing every paper that has an abstract, can you put another column PMID_COUNT with the result count from PubMed?

There is no limit set All PMIDs with abstract are processed

I will restart and let you know

PritiShaw commented 4 years ago

After Restart 200 terms have been processed, average time per term is 1.5min There are total around 5690 terms remaining Estimated more time required: 5days Old output New output

cannin commented 4 years ago

Did you add the PMID_COUNT column?

PritiShaw commented 4 years ago

Processing of terms with hits >=10 have been completed Output

Code has been made as Docker image, adding tests now