Closed lakikowolfe closed 2 years ago
medium importance with unclear timeline
Timeline for completion would need be by the end of the funding cycle, July 29th
Portools
functionssyndccutils
functionsThere's a lot going on in here that isn't pubmed. Most of it seems related to synapse querying / project building with some random basic python utilities
Two main Pubmed query functions:
Given a list of grant numbers with associated synapse metadata: consortium synapse ID and grant sub-type, scrapes
pubMed for each grant's publication and retrieves simple information such as publication title, year, and authors.
It also checks if any GEO data has been produced by the publication study. If so, then it saves the GEO html
links in a comma separated list. Per each publication, there will be a row in the final dataframe/synapse table
that maps back to the grant number and consortium synapse ID(i.e, the Key of this table is the PubMed column).
:param pubmedIds: List of pubmed Ids
:param consortiumGrants: List of grants
:param consortiumView: File-view or table holding projects grant annotations
:param consortiumName: Consortium name ex. csbc
Given a list of grant numbers pulled from a synapse table column, utilizes a pubmed API to generate a search query.
This query is constructed by the union ('or' logic) of all the grant numbers, which would aid in pulling down a list
of all PubMed publication id's associated with the grants. Then it will go through the PubMed id's and scrape the
publication for basic informative information.
:param args: User defined arguments
:param syn: A logged in synapse object
Pubmed pull is exactly the same (pubmed ID, author, title, pub date, journal).
checks for a GEO submission and saves GEO html links if they exist
@milen-sage this is what I've gathered after going through syndccutils/__main__.py
@lakikowolfe thanks!
Were you able to setup syndccutils on your machine? Once it's setup, this is a command that would gather publications for CSBC, for future reference:
syndccutils pubmed --projectId syn7080714 --tableId syn20938140 --name CSBC
You might need access to these synapse resources to test it out.
I'll be working on this today! Will let you know if I hit any Synapse blocks
~Blocker : Can't access syn20938140~
Given acess my Milen this morning.
Rerun command from above.
Traceback (most recent call last):
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1192, in performMain
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 433, in pubmed
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 159, in getGrantList
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/pandas-1.4.2-py3.9-macosx-10.9-x86_64.egg/pandas/core/generic.py", line 5575, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'grantNumber'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/bin/syndccutils", line 33, in <module>
sys.exit(load_entry_point('syndccutils==1.1.0', 'console_scripts', 'syndccutils')())
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1204, in main
File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1194, in performMain
AttributeError: 'Namespace' object has no attribute 'debug'
Looks like first error is occuring in getGrantList
.
def getGrantList(syn, tableSynId):
"""
Get's the column containing grant numbers, drops the empty cells if any, and returns a list of grant numbers.
:param syn: A logged in synapse object
:param tableSynId: File-view or table holding projects grant annotations
:return:
"""
consortiumGrants = syn.tableQuery("select * from %s" % tableSynId)
consortiumGrants = consortiumGrants.asDataFrame()
consortiumGrants = list(consortiumGrants.grantNumber.dropna())
return consortiumGrants
line 159 - consortiumGrants = list(consortiumGrants.grantNumber.dropna())
Based on my understanding of the pubmed
code chunk below (starts at line 437):
if args.grantviewId is not None:
grantviewId = args.grantviewId
else:
grantviewId = "syn10142562"
If grantviewId
is not provided it defaults to syn10142562.
This table doesn't have the grantNumber
attribute that getGrantList
is looking for.
@milen-sage you mentioned that the grant table might have changed/moved. Do you know if there is a grant table other than syn10142562 to point to?
@brynnz22 and @vpchung we're revisiting the pubmed crawling script used in the past for CSBC. Do you know what is the current table/view containing grant information that we can point to? See @lakikowolfe question above. Thanks!
The latest pubmed crawler currently uses this table to query for related publications.
One thing to note - our last meeting with NCI has revealed that we may be missing up to 20 or so grants at the moment 😅
Thanks @vpchung! We are just interested in the kinds of attributes the crawler returns, and particular grants/publications being omitted will be fine for now.
Would it help if I provide what a sample final output would be? I am also more than happy to do a code walk with @lakikowolfe if that will help speed things up as well.
yeah that would be helpful! @milen-sage do we need anything other than the final output?
@vpchung Can I be invited to the walkthrough? This would be helpful for me too!
Just the final output would be good for now; and I don't need to be a bottleneck for scheduling :)
Thanks @vpchung !
mc2-center/pubmed-crawler
is a refactored version of the originally linked syndccutils
scriptporTools
: does not pull information for pmids that already have been collectedporTools
in that it is highly specific to group using it - would take some work to refactorAs a group we had a few questions:
Brynn and Verena please let me know if I missed anything!
That's the gist of it! In addition to what @lakikowolfe mentioned, the MC2 pubmed-crawler also captures the journal, DOI, keywords, and MeSH terms. For related databases, we specifically look for GSE, SRP, and dbGaP.
Loren and I agreed that it may seem like duplicated effort here, so maybe we could create a more generalizable pubmed crawler that could be used by both projects (+ more)?
@ychae could you capture the discussion here in a doc in one of our product planning gdrive folders? We can call it pubcrawler+ for now (or another more creative name)? :)
@vpchung and @lakikowolfe thanks for syncing up and researching this. Yes, going forward we'd need to capture user scenarios across teams that need this tool and plan which features to emphasize + standardize in the first version of a Sage-wide tool. My sense is this would be well-used service.
In preparation for outlining linking dataset to publications as a checkpoint when datasets are moved from pre-production to production, explore the code here that was used to crawl publication information for CSBC.
Acceptance criteria: have compared the entrez calls in porTools script with calls in python code above and noted additional ones, if any in the python script.