Closed wkumler closed 1 year ago
The pubchem script doesn't catch 100% of the compounds, but it was a good first automation. Laura did say she was going to do some of the missed ChEBI stuff by hand, I think we can probably apply the same process to the Pubchem stuff that got missed.
Ok yeah I see a good chunk is missing. I'll review the script and make sure it's not missing anything obvious before we move to manual fixing.
I did a review of the pubchem script and reminded myself of where we stood on that during the last update. The script attempts to automate what it can from our primary Compound_Name column, with about a 60% success rate as of the last Pubchem download. I'm recalling that you mentioned that PubChem is harder to access now (or was that a different database?). The script works by just taking our primary compound name and running it through the RESTful API, and then returning whatever that produces. One thing I can try to increase output is checking it against all names we have for a compound, rather than just our primary name, and see if that returns more. Otherwise, I remember speaking to Laura about this and I think we decided that manual changes as people need those compound names is probably the best use of our time.
What do you think? I'm totally open to suggestions.
PubChem should be the same as it always is, Metlin is the one that changed recently. I guess I'd be curious to see why GBT fails - does PubChem not have that as a synonym? We got our Compound_Name column from somewhere to standardize them all, right? Maybe there's an issue there too.
The Compound_Name column is what we decided as a lab to use as the primary name in conjunction with agreed-upon conventions like capitalizations, symbols, etc. That column may not be connected to a database online. It was in consideration of the many ways we often refer to a compound, with the main downside being that we prioritized ease of use over standardization for use in third-party situations. So yeah, it probably does make sense to run PubChem with more than just the Compound_Name column!
As for GBT, I just ran it and noticed that the issue is that we call it glycine betaine, and PubChem calls it Betaine.
Reviewing this issue: Because of different names used for compounds across databases, it looks like fully automating all of the external names is probably impractical. For example, glycine betaine works as a name for kegg and chebi but not pubchem, while betaine glycine works for pubchem but not the other two.
@LTCarlson, I will review the Add_New_Standard script with you this week to make sure it's useful, but after that I am tempted to lean towards fixing the problems individually as they arise in the future.
@wkumler , do you have objections or suggestions?
We seem to be missing a few pubchem identifier numbers for common compounds that really should have them, GBT being biggest among them. Not sure whether this is something we want to script or do manually, but it'd be nice to have more. Use case right now is uploading to Metabolomics Workbench and using this standards sheet to supply KEGG and PubChem IDs for meta-analysis.