RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Improve fastNGD hit rate #729

Closed amykglen closed 4 years ago

amykglen commented 4 years ago

Our pickleDB fastNGD system seems to work well, except for the fact that it only has coverage of about 10% of KG1 nodes, which means the slower back-up NGD method often has to be used.

The coverage is low because we're lacking arbitrary curie -> MESH mappings, and such mappings aren't exactly easy to find (except for using NCBI eUtils, which is very slow and we haven't found a data dump for). (Description of the problem here.)

Our initial attempt used OxO to grab as many curie->MESH mappings as we could for KG1 nodes, but we've since thought of some other paths to explore/pursue:

  1. Can we use equivalent_to/close_match relationships in KG2 to point curies to the same MESH term as their equivalent_tos/close_matchs? (explored some here - seems it could add some mappings for KG1 nodes, but not a lot)
  2. Can we use SRI team's node normalizer to point curies to the same MESH term as their equivalent_curies? (explored some here - seems it could increase coverage of KG1 nodes to 18%)
  3. Can we use KGNodeIndex.get_equivalent_curies to point curies to the same MESH term as their equivalent curies? (not yet explored, but should have more success for us than the node normalizer(?))
  4. Can we use SemMedDB to get CUI -> PMID mappings, and then map all our curies to CUIs? (so MESH terms would not be in the picture here) (discussed here) (note: KGNodeIndex.get_equivalent_curies may help here too for finding curie -> CUI mappings)
  5. Can we ETL/utilize Text Mining Provider (PI: Bill Baumgartner) as a larger(?)/better(?) SemMedDB/NCBIeUtils alternative?

Also note: The mapping work done so far has only been for KG1 nodes, but we will also need mappings for KG2 and BTE curies, since we now use them as KPs as well...

For reference:

saramsey commented 4 years ago

Can we use equivalent_to/close_match relationships in KG2 to point curies to the same MESH term as their equivalent_tos/close_matchs? (explored some here - seems it could add some mappings for KG1 nodes, but not a lot)

yes

Can we use SRI team's node normalizer to point curies to the same MESH term as their equivalent_curies? (explored some here - seems it could increase coverage of KG1 nodes to 18%)

yes; assume we would use CacheControl caching

Can we use KGNodeIndex.get_equivalent_curies to point curies to the same MESH term as their equivalent curies? (not yet explored, but should have more success for us than the node normalizer(?))

yes, and we might consider doing this before calling out to the SRI normalizer since one is a local DB lookup and the other involves calling a REST service

Can we use SemMedDB to get CUI -> PMID mappings, and then map all our curies to CUIs? (so MESH terms would not be in the picture here) (discussed here) (note: KGNodeIndex.get_equivalent_curies may help here too for finding curie -> CUI mappings)

I like the idea of having this fourth on the list, since it is not really curated by humans

Can we ETL/utilize Text Mining Provider (PI: Bill Baumgartner) as a larger(?)/better(?) SemMedDB/NCBIeUtils alternative?

Seems worth reaching out to Text Mining Provider to see if they have an example code fragment or documentation for accessing their KP.

dkoslicki commented 4 years ago

Idea from Eric: use NodeSynonymizer to help map arbitrary curies to MeSH terms, thus increasing the hit rate to fast NGD

@amykglen volunteered to check the coverage to see if this idea would lead to a large/medium/small improvement to fastNGD

edeutsch commented 4 years ago

So one part that leaves me puzzled is when I look at PubMed XML records, I see this:

      <KeywordList Owner="NOTNLM">
        <Keyword MajorTopicYN="N">Melatonin</Keyword>
        <Keyword MajorTopicYN="N">OPA1 and AMPK signaling pathway</Keyword>
        <Keyword MajorTopicYN="N">mitochondrial fusion</Keyword>
        <Keyword MajorTopicYN="N">mitophagy</Keyword>
        <Keyword MajorTopicYN="N">myocardial ischemia reperfusion injury</Keyword>
      </KeywordList>
    </MedlineCitation>

I don't see MeSH ids anywhere. which is why I'm puzzled. Seems like a straightforward process to scrape out these keywords and search the NodeSynonymizer for these and associate PubMedIds with KG2 nodes and their curies.

So what am I missing?

amykglen commented 4 years ago

this is an example of some MESH IDs in a pubmed record - (e.g., "D000070" for the first DescriptorName):

<MeshHeadingList>
        <MeshHeading>
          <DescriptorName UI="D000070" MajorTopicYN="N">Acebutolol</DescriptorName>
          <QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D000319" MajorTopicYN="N">Adrenergic beta-Antagonists</DescriptorName>
          <QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D005260" MajorTopicYN="N">Female</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D006973" MajorTopicYN="N">Hypertension</DescriptorName>
          <QualifierName UI="Q000188" MajorTopicYN="Y">drug therapy</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D010869" MajorTopicYN="N">Pindolol</DescriptorName>
          <QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D011247" MajorTopicYN="N">Pregnancy</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D011249" MajorTopicYN="N">Pregnancy Complications, Cardiovascular</DescriptorName>
          <QualifierName UI="Q000188" MajorTopicYN="Y">drug therapy</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D011433" MajorTopicYN="N">Propranolol</DescriptorName>
          <QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
        </MeshHeading>
</MeshHeadingList>

but since the NodeSynonymizer creates its synonym 'groupings' based on names anyway, then perhaps bypassing the actual MESH IDs and plugging the text name into the NodeSynonymizer is effectively the same thing?

amykglen commented 4 years ago

though after #888, then I suppose NodeSynonymizer won't strictly be based on concept names. so perhaps it's still better to go through mesh terms?

edeutsch commented 4 years ago

Either way!

python node_synonymizer.py --lookup=MESH:D011433
python node_synonymizer.py --lookup=Propranolol

yield the same extensive result. All you ever want to know about the concept.

You could try both. If you don't get an answer from either one, then it's not a node in KG2, so is irrelevant for our purposes anyway! #888 will only increase our equivalent curies even further, it won't detract from name lookups or anything else.

For speed we'll want to do lookups in batches, so let's optimize the method you'd want to use before setting it loose on billions of PubMed Ids. It may even make sense to transfer the on-disk SQLite database to an in-memory one to speed things up even more.

For papers that don't have MeshHeading, maybe we can also use the KeywordList that I mentioned above. The XML file I'm looking at doesn't have any MeshHeading elements. Only KeywordList. strange. Or maybe automatically use both.

Anyway, some optimization will likely be needed, but I think we should totally be able to precompute everything we need and have an UltraFastNGD that never consults eutils and we can call it every time without slowing things down.

amykglen commented 4 years ago

ah, nice, that all makes sense. awesome! yes, I'll do a little preliminary exploration, report back, and then certainly will consult with you about optimizing the synonymizer method before I kick off any sort of full build process...

re:

then it's not a node in KG2, so is irrelevant for our purposes anyway

well, technically we also have BTE as a KP... :) but definitely think first priority is to get an UltraFastNGD that works with KG2.

edeutsch commented 4 years ago

yes, true about BTE, but I think KG2 already surpasses BTE or will soon do so, and while BTE may have some edges that KG2 doesn't have, do you think BTE has concepts in it that KG2 doesn't have at this point? It would be interesting to find examples. You found some mouse proteins, but I think we probably actively don't want those in KG2? It might be fun to keep a dump file of all Pubmed Mesh concepts that are NOT found in NodeSynonymizer, so that we can eyeball them and ponder why they were not found, and perhaps improve our system, either KG2 or NodeSynonymizer or both. For that matter, maybe Expander can keep a dumpfile of curies that it wanted to find in NodeSynonymizer but didn't and we can learn from that.

amykglen commented 4 years ago

ok, started by estimating the number of KG2/KG1 nodes that are mappable to a MESH curie using the NodeSynonymizer - here are the results (these are averages of 20 batches of 4000 randomly selected nodes from each KG):

Estimated number of KG1 nodes mappable to a MESH curie using NodeSynonymizer: 20%
Estimated number of KG2 nodes mappable to a MESH curie using NodeSynonymizer: 12%

so that would still double our current coverage for KG1 nodes, but I'm guessing the text-based method would be even more successful, since it bypasses the whole need for a MESH curie. going to investigate that next.

amykglen commented 4 years ago

status update: so after doing some thinking about how the MESH term name/keyword-based method will work, I can't think of a way to really programmatically estimate the coverage it would give us without basically starting to build it... (because it's kind of a backwards path to the prior approach: we need to scrape all the MESH term names/keywords out from all pubmed articles, feed those into the NodeSynonymizer, see if it returns any matching curies for each term/keyword, record those connections, and then in the end see what percentage of nodes we were able to link to terms/keywords.)

I don't think this will be a huge task though, fortunately.. at least, the scraping pubmed articles portion of the problem should only take a few hours, I think.. and if we have an optimized NodeSynonymizer method, I think that half of the problem shouldn't be too bad either. so @edeutsch - I think a method that takes in a list of names and then returns a dictionary of each name mapped to its corresponding curie synonyms (whether from KG2, KG1, or SRI) is what would be optimal. does that already exist or seem reasonable?

edeutsch commented 4 years ago

status update: so after doing some thinking about how the MESH term name/keyword-based method will work, I can't think of a way to really programmatically estimate the coverage it would give us without basically starting to build it... (because it's kind of a backwards path to the prior approach: we need to scrape all the MESH term names/keywords out from all pubmed articles, feed those into the NodeSynonymizer, see if it returns any matching curies for each term/keyword, record those connections, and then in the end see what percentage of nodes we were able to link to terms/keywords.)

I suggest: no estimate needed. just go ahead and build it! It will be grand!

I don't think this will be a huge task though, fortunately.. at least, the scraping pubmed articles portion of the problem should only take a few hours, I think.. and if we have an optimized NodeSynonymizer method, I think that half of the problem shouldn't be too bad either. so @edeutsch - I think a method that takes in a list of names and then returns a dictionary of each name mapped to its corresponding curie synonyms (whether from KG2, KG1, or SRI) is what would be optimal. does that already exist or seem reasonable?

So the current method get_normalizer_results() can do what you envision, but it would definitely need to be optimized because it is too slow. Since we need this to be fast, a bespoke method might be warranted.

But I'm also wondering: might it not be better to use get_canonical_curies() to just fetch the canonical curies of each concept and associate papers with canonical curies rather than all nodes in the KGs. This make coverage calculation a little harder, but not a big problem. And it requires a two-step process when computing NGD: instead of querying directly with the curie, one has to first determine the canonical curie and then query by that. One advantage is that the database can be much smaller. Maybe even the biggest table could be just two integers (pubmed id and concept_id, an integer pk for each concept you can map) which could be super fast to index and query.

aha, maybe even a cooler idea is you could imagine having a concept table where each concept has a PK id, then a PubMed (MeSH) name, and then a canonical curie. You could even just create concept rows for everything in pubmed independently of whether we can map to curies today. Then, we can have a separate process that can be progressively smarter about mapping MeSH names to canonical curies without having to rescan all of PubMed. So have two processes, one that associates pubmed_ids with keywords (each keyword get a concept row and concept_id PK) (and the linking table is just pubmed_id and concept_id) and then a second independent process that associates those keywords with our curies. either all node curies or canonical curies only, depending on what we want to do. We can then easily ask the question: which concepts can we not find curies for and maybe some simple text munging can resolve a lot more (like maybe writing code that can associate "diabetes, type II" with "type 2 diabetes", etc.) in a faster manner than starting from the beginning again (i.e. just improving the concept table)

One possible snag is that two concept_ids (with different names) map to the can canonical curie. Awkward but we could do something sensible with that.

Another idea is writing the process to inherently work in batches may be a speed bonus. NodeSynonymizer works faster in batches and SQL INSERTs work much faster in batches. I INSERT records in batches of 5000 while building the NodeSynonymizer database and that seems to work reasonably speedily. Doing them individually is slower. Maybe process 10000 publications in a batch, keep concept ids in an in-memory hash during the whole step 1 process and write out pubmed_ids-concept_id mappings in batches of 5000. Do NodeSynonymizer lookups in batches of 10000, too..

Two other ideas that are still bubbling in my head are: 1) Can we do better than just MeSH keywords? a) Can we also use the KeywordList (above) to create and associate concepts? b) Can we build a (separate) simple text miner that looks for node names from the NodeSynonymizer in pubmed abstracts? and store that in the same way? especially protein names/accession numbers? 2) Can we follow on Noel's idea and potentially create edges between concepts that have none? Maybe something like:

I don't advocate these latter bubbly ideas for the first pass, but maybe we can build a database that could facilitate that later?

Apologies for the length brain dump, but maybe some ideas will be useful!

amykglen commented 4 years ago

nice! yes, great thoughts... some responses:

and I like your 1b) and 2) ideas, though agree that makes sense to save them for another pass.

edeutsch commented 4 years ago

great!

so get_canonical_curies() accepts names as input?

not at the moment. But I can add right now.

The only design decision is: should I do this: result = node_synonymizer.get_canonical_curies(curies=curie_list) result = node_synonymizer.get_canonical_curies(names=name_list) or this: result = node_synonymizer.get_canonical_curies(potentially_mixed_names_and_or_curies_list) ?

The latter is potentially easier for the user, but probably slower because I'll need to do two queries, and more complicated to code.

amykglen commented 4 years ago

I'm fine with the former method (curies=curie_list/names=name_list). (And I think I'm the only one who uses get_canonical_curies at the moment, so easy for me to update my call to it if needed..)

edeutsch commented 4 years ago

okay, committed to master: result = node_synonymizer.get_canonical_curies(curies=curie_list, names=name_list) unlimited batch size. minimally tested. please report issues.

amykglen commented 4 years ago

awesome! I'll start some testing of that method soon..

fyi, Steve is upping the memory on pubmed.rtx.ai for me, and then I'll get the scraping-all-of-pubmed portion of the build kicked off on there.. :)

amykglen commented 4 years ago

for the record and for clarity's sake: I did a bit more digging into what different elements there are in pubmed xml files, and the ones I'm planning to use (i.e., extract their text content and use it as a key) are:

MESH DescriptorNames and QualifierNames, like seen here:

        <MeshHeading>
          <DescriptorName UI="D004798" MajorTopicYN="N">Enzymes</DescriptorName>
          <QualifierName UI="Q000652" MajorTopicYN="Y">urine</QualifierName>
        </MeshHeading>

NameOfSubstances from ChemicalList, like seen here:

      <ChemicalList>
        <Chemical>
          <RegistryNumber>0</RegistryNumber>
          <NameOfSubstance UI="D008055">Lipids</NameOfSubstance>
        </Chemical>
        <Chemical>
          <RegistryNumber>0</RegistryNumber>
          <NameOfSubstance UI="C015518">sulfolipids</NameOfSubstance>
        </Chemical>
      </ChemicalList>

Keywords, like seen here:

      <KeywordList Owner="NOTNLM">
        <Keyword MajorTopicYN="N">GLV (green leaf volatile)</Keyword>
        <Keyword MajorTopicYN="N">HIPV (herbivory-induced plant volatile)</Keyword>
        <Keyword MajorTopicYN="N">Nicotiana attenuata</Keyword>
      </KeywordList>

and GeneSymbols, like seen here:

      <GeneSymbolList>
        <GeneSymbol>bcl-2</GeneSymbol>
        <GeneSymbol>myc</GeneSymbol>
      </GeneSymbolList>

let me know if any of those elements seem like a bad idea... (some documentation on what they capture is here: https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html)

(the original pickleDB fastNGD system only extracts DescriptorName UIs, fyi..)

edeutsch commented 4 years ago

Seems like a good idea to record all those different kinds of keywords, I think.

I wonder if it might be worth storing and using the identifiers if provided as well. NodeSynonymizer certain has those terms above: python node_synonymizer.py --lookup MESH:D004798 --kg_name KG2 But it also has the names, so the likelihood of it having the identifier but not the name is probably small, if not zero. If you did keep them, then we could ask the question: in which cases does lookup by identifier and name yield a different answer, which might shed light on a problem.

One other random thought/concern that I didn't mention earlier: I wonder if we can/should try to determine which species each paper is about. If a paper can be identified to be about Arabidopsis, say, maybe it would be better to just discard the paper rather than potentially pollute the index with unrelated information. Unclear if the danger of unjustified exclusion is worse than the benefit of justified exclusion, though. Probably too hard to tackle now, but I'll toss it out there.

amykglen commented 4 years ago

hmm, I'm seeing this error when I send a large batch of names to get_canonical_curies(), @edeutsch:

  Sending NodeSynonymizer.get_canonical_curies() a list of 26120 concept names..
Traceback (most recent call last):
  File "build_ngd_database.py", line 169, in <module>
    main()
  File "build_ngd_database.py", line 165, in main
    database_builder.build_curie_to_pmids_db()
  File "build_ngd_database.py", line 90, in build_curie_to_pmids_db
    canonical_curies_dict = synonymizer.get_canonical_curies(names=concept_names)
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Overlay/ngd/../../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_canonical_curies
    cursor.execute( sql )
sqlite3.OperationalError: near "association": syntax error

(no error when I send a batch of only 79 names.)

the input list causing the error is in this file: input_names_causing_error.txt

amykglen commented 4 years ago

@edeutsch - found a much smaller list that kind of error occurs for:

['Campylobacter Infections', 'lomefloxacin', "Practice Patterns, Physicians'", 'Drug Resistance, Microbial']

Traceback (most recent call last):
  File "build_ngd_database.py", line 173, in <module>
    main()
  File "build_ngd_database.py", line 169, in main
    database_builder.build_curie_to_pmids_db()
  File "build_ngd_database.py", line 94, in build_curie_to_pmids_db
    canonical_curies_dict = synonymizer.get_canonical_curies(names=['Campylobacter Infections', 'lomefloxacin', "Practice Patterns, Physicians'", 'Drug Resistance, Microbial'])
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Overlay/ngd/../../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_canonical_curies
    cursor.execute( sql )
sqlite3.OperationalError: near "drug": syntax error

looks like perhaps it's related to the quote within a quote in "Practice Patterns, Physicians'"?

edeutsch commented 4 years ago

Sorry, yes, a quote problem. I have just fixed in master

amykglen commented 4 years ago

awesome - confirmed it's working for me now. thanks!

amykglen commented 4 years ago

ok, so the build is all done - an update:

I started trying to estimate the 'back-up' NGD method's coverage (though it's a little bit difficult to do due to slowness/API errors)... but it's seeming like it's somewhere around 40-50% for KG2 nodes (or, at least, that's the appx. amount NormGoogleDistance.get_pmids_for_all() returns PMIDs for)

it seems like it'd be useful to know more about what types of nodes we're missing with the new system, so I think I will do a little analysis of that..

edeutsch commented 4 years ago

outstanding!

I agree that finding some examples where eutils NGD differs substantially from local NGD and trying to understand that would be instructive.

amykglen commented 4 years ago

fyi - some estimates as to the degree to which different node types are covered in the new system...

Estimated coverage of KG1 nodes: 34%.
  chemical_role: 100%
  activity: 100%
  continuant: 100%
  organism_taxon: 100%
  biological_molecular_complex: 100%
  occupation: 100%
  behavior: 100%
  inheritance: 100%
  concept: 86%
  anatomical_entity: 68%
  protein: 66%
  chemical_substance: 61%
  disease: 50%
  product: 50%
  unknown_category: 48%
  RNA: 23%
  phenotypic_feature: 20%
  cellular_component: 18%
  pathway: 12%
  gene: 8%
  finding: 7%
  biological_process: 6%
  disease_susceptibility: 6%
  metabolite: 5%
  named_thing: 4%
  molecular_function: 2%
  gene_set: 0%
  deprecated_node: 0%
  substance: 0%
Estimated coverage of KG2 nodes: 17%.
  chemical_role: 100%
  semantic_type: 100%
  diagnostic_or_prognostic_factor: 100%
  environmental_feature: 100%
  occupation: 69%
  disease_characteristic: 50%
  application: 50%
  unknown_category: 46%
  behavior: 42%
  protein: 34%
  phenotypic_quality: 31%
  disease: 30%
  group_of_people: 29%
  organism_taxon: 28%
  phenotypic_feature: 27%
  organization: 27%
  language: 23%
  information_entity_type: 21%
  biological_molecular_complex: 20%
  chemical_substance: 17%
  named_thing: 16%
  cellular_component: 16%
  biological_substance: 14%
  concept: 13%
  product: 12%
  substance: 12%
  property: 12%
  disease_susceptibility: 11%
  biological_process: 10%
  anatomical_entity: 9%
  mechanism_of_action: 9%
  biological_role: 9%
  gene: 8%
  conceptual_entity: 8%
  continuant: 7%
  pathway: 6%
  device: 6%
  metabolite: 6%
  activity: 3%
  molecular_function: 3%
  finding: 2%
  phenomenon_or_process: 2%
  gene_set: 2%
  attribute: 1%
  drug: 1%
  deprecated_node: 0%
  cell_type: 0%
  relationship_type: 0%
  hazardous_substance: 0%
  therapy: 0%
  method: 0%
  prevalence: 0%
  RNA: 0%
  data_source: 0%
  protein_family: 0%

(note: the 'preferred_type' for each node was used to calculate these data.)

amykglen commented 4 years ago

so I still need to do some comparisons to eUtils to start to see what it's finding results for that we're not, but because it was easy to do, I first tried out extracting PMIDs from the publications property on nodes/edges in KG2 and adding those to our big curie->PMID database.

with those in the picture, estimated coverage for the local ngd system is 58% for KG1 nodes, 34% for KG2.

this does take us into the realm of using non-human-added data (the majority of the edges with publications are from SemMedDB), so perhaps not as ideal. but the coverage boost seems pretty significant.

the final database is 5.4G with this additional data.

amykglen commented 4 years ago

ok, per the plan discussed at today's mini-hackathon, I've now:

so now if you run the pytest suite after pulling (and running pip install -r requirements.txt), the first test that uses overlay(action=compute_ngd) will take a few minutes extra as it downloads the database. but then after that, ngd should be much faster. (@dkoslicki)

(still improvements to be made, but it is at least much faster than the current system - hence the intermediary deployment.)

dkoslicki commented 4 years ago

Awesome, thanks @amykglen! Trying it out now...

amykglen commented 4 years ago

fyi - I just found a bug that was preventing the fast NGD database from being used (the download step was being completed successfully, but the path to use the DB was incorrect) - just tested and pushed a fix to master.

(so if you tried it before, I think it would've only been using the slow back-up eUtils method. but in the queries I've tested I'm seeing fast NGD usage rates around 90-100%.)

amykglen commented 4 years ago

ok, I think I'm going to finally close this issue - I'm seeing an ~95%+ utilization rate of the fastNGD system for ARAX queries on test/production. some examples of 'misses' are written up in #1046 and #1047 (as @edeutsch requested), and I think any discussion of whether we want to pursue even greater coverage can happen over there.

edeutsch commented 4 years ago

I think 95% is a sensational utilization rate! This has moved NGD values from a slow painful process that I was reluctant to do to something that we can do always. Many thanks for getting this to work!