Familiarize with python-based version of publication information crawling

lakikowolfe commented 2 years ago

In preparation for outlining linking dataset to publications as a checkpoint when datasets are moved from pre-production to production, explore the code here that was used to crawl publication information for CSBC.

Acceptance criteria: have compared the entrez calls in porTools script with calls in python code above and noted additional ones, if any in the python script.

ychae commented 2 years ago

medium importance with unclear timeline

abbywall commented 2 years ago

Timeline for completion would need be by the end of the funding cycle, July 29th

lakikowolfe commented 2 years ago

`Portools` functions

Using Grant Serial Numbers query pubmed associated publications' pubmed ID
Using publications pubmed ID query pubmed for metadata (i.e. author, journal, pub date, etc)
Create entity name (seems like a unique identifier made up of the auth name, journal name, year)
Set up synapse annotations
Update synapse table

`syndccutils` functions

There's a lot going on in here that isn't pubmed. Most of it seems related to synapse querying / project building with some random basic python utilities

Two main Pubmed query functions:

getPMIDDF

Given a list of grant numbers with associated synapse metadata: consortium synapse ID and grant sub-type, scrapes
    pubMed for each grant's publication and retrieves simple information such as publication title, year, and authors.
    It also checks if any GEO data has been produced by the publication study. If so, then it saves the GEO html
    links in a comma separated list. Per each publication, there will be a row in the final dataframe/synapse table
    that maps back to the grant number and consortium synapse ID(i.e, the Key of this table is the PubMed column).
    :param pubmedIds: List of pubmed Ids
    :param consortiumGrants: List of grants
    :param consortiumView: File-view or table holding projects grant annotations
    :param consortiumName: Consortium name ex. csbc

pubmed

    Given a list of grant numbers pulled from a synapse table column, utilizes a pubmed API to generate a search query.
    This query is constructed by the union ('or' logic) of all the grant numbers, which would aid in pulling down a list
    of all PubMed publication id's associated with the grants. Then it will go through the PubMed id's and scrape the
    publication for basic informative information.
    :param args: User defined arguments
    :param syn: A logged in synapse object

Key similarities

Pubmed pull is exactly the same (pubmed ID, author, title, pub date, journal).

Key differences

checks for a GEO submission and saves GEO html links if they exist

lakikowolfe commented 2 years ago

@milen-sage this is what I've gathered after going through syndccutils/__main__.py

milen-sage commented 2 years ago

@lakikowolfe thanks!

Were you able to setup syndccutils on your machine? Once it's setup, this is a command that would gather publications for CSBC, for future reference:

syndccutils pubmed --projectId syn7080714 --tableId syn20938140 --name CSBC

You might need access to these synapse resources to test it out.

lakikowolfe commented 2 years ago

I'll be working on this today! Will let you know if I hit any Synapse blocks

lakikowolfe commented 2 years ago

~Blocker : Can't access syn20938140~

Given acess my Milen this morning.

lakikowolfe commented 2 years ago

Rerun command from above.

prompted to sign into synapse
completes a synapse table query
Errors out

    Traceback (most recent call last):
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1192, in performMain
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 433, in pubmed
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 159, in getGrantList
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/pandas-1.4.2-py3.9-macosx-10.9-x86_64.egg/pandas/core/generic.py", line 5575, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'grantNumber'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/bin/syndccutils", line 33, in <module>
    sys.exit(load_entry_point('syndccutils==1.1.0', 'console_scripts', 'syndccutils')())
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1204, in main
  File "/Users/lwolfe/Documents/work/ibc-fair/.misc_venv/lib/python3.9/site-packages/syndccutils-1.1.0-py3.9.egg/syndccutils/__main__.py", line 1194, in performMain
AttributeError: 'Namespace' object has no attribute 'debug'

Looks like first error is occuring in getGrantList.

def getGrantList(syn, tableSynId):
    """
    Get's the column containing grant numbers, drops the empty cells if any, and returns a list of grant numbers.

    :param syn:  A logged in synapse object
    :param tableSynId: File-view or table holding projects grant annotations
    :return:
    """
    consortiumGrants = syn.tableQuery("select * from %s" % tableSynId)
    consortiumGrants = consortiumGrants.asDataFrame()
    consortiumGrants = list(consortiumGrants.grantNumber.dropna())
    return consortiumGrants

line 159 - consortiumGrants = list(consortiumGrants.grantNumber.dropna())

lakikowolfe commented 2 years ago

Based on my understanding of the pubmed code chunk below (starts at line 437):

    if args.grantviewId is not None:
        grantviewId = args.grantviewId
    else:
        grantviewId = "syn10142562"

If grantviewId is not provided it defaults to syn10142562.

This table doesn't have the grantNumber attribute that getGrantList is looking for.

lakikowolfe commented 2 years ago

@milen-sage you mentioned that the grant table might have changed/moved. Do you know if there is a grant table other than syn10142562 to point to?

milen-sage commented 2 years ago

@brynnz22 and @vpchung we're revisiting the pubmed crawling script used in the past for CSBC. Do you know what is the current table/view containing grant information that we can point to? See @lakikowolfe question above. Thanks!

vpchung commented 2 years ago

The latest pubmed crawler currently uses this table to query for related publications.

One thing to note - our last meeting with NCI has revealed that we may be missing up to 20 or so grants at the moment 😅

milen-sage commented 2 years ago

Thanks @vpchung! We are just interested in the kinds of attributes the crawler returns, and particular grants/publications being omitted will be fine for now.

vpchung commented 2 years ago

Would it help if I provide what a sample final output would be? I am also more than happy to do a code walk with @lakikowolfe if that will help speed things up as well.

lakikowolfe commented 2 years ago

yeah that would be helpful! @milen-sage do we need anything other than the final output?

brynnz22 commented 2 years ago

@vpchung Can I be invited to the walkthrough? This would be helpful for me too!

milen-sage commented 2 years ago

Just the final output would be good for now; and I don't need to be a bottleneck for scheduling :)

Thanks @vpchung !

lakikowolfe commented 2 years ago

Notes from code walkthrough with @brynnz22 and @vpchung

mc2-center/pubmed-crawler is a refactored version of the originally linked syndccutils script
Collects Title, Author, Pmid, Pub date, Datasets related to publication
Outputs a manifest in excel. Check for bugs, add some metadata by hand
More efficient than porTools: does not pull information for pmids that already have been collected
Similar to porTools in that it is highly specific to group using it - would take some work to refactor

As a group we had a few questions:

What are the different use cases for pub crawler output?
Is there a timeline on this?
Any other pub crawlers at Sage?
R vs python implementation?

Brynn and Verena please let me know if I missed anything!

vpchung commented 2 years ago

That's the gist of it! In addition to what @lakikowolfe mentioned, the MC2 pubmed-crawler also captures the journal, DOI, keywords, and MeSH terms. For related databases, we specifically look for GSE, SRP, and dbGaP.

Loren and I agreed that it may seem like duplicated effort here, so maybe we could create a more generalizable pubmed crawler that could be used by both projects (+ more)?

milen-sage commented 2 years ago

@ychae could you capture the discussion here in a doc in one of our product planning gdrive folders? We can call it pubcrawler+ for now (or another more creative name)? :)

@vpchung and @lakikowolfe thanks for syncing up and researching this. Yes, going forward we'd need to capture user scenarios across teams that need this tool and plan which features to emphasize + standardize in the first version of a Sage-wide tool. My sense is this would be well-used service.

Sage-Bionetworks / porTools