MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

CTS retrieving non-live CIDs #225

Closed schymane closed 11 months ago

schymane commented 4 years ago

We should make our default source of CIDs PubChem, and not CTS. There are too many discrepancies/error cropping up. @meier-rene we may have to check the "status" of CIDs during validation, to catch and fix.

Example from freshly-created infolist: https://pubchem.ncbi.nlm.nih.gov/compound/4644

tsufz commented 4 years ago

Hi Emma, This is a great enhancement. Using the source data source is always the best option.

Have a nice weekend, Tobi

schymane commented 4 years ago

getPcId will also return non-live CIDs for non-standard tautomeric forms. We can fix this using a function in RChemMass, but we should make sure we upgrade the getPcId function to automatically do this. On the "todo" list ...

> getPcId("NPZTUJOABDZTLV-UHFFFAOYSA-N")
[1] 2759291
> getPCIDs.CIDtype(getPcId("NPZTUJOABDZTLV-UHFFFAOYSA-N"),type="preferred")
[1] 135399369
schymane commented 4 years ago

@meowcat I will need to upgrade getPcId to make sure it returns "live" CIDs, this is fine - but where can I check whether we grab PubChem CIDs from PubChem vs CTS? If we already use getPcId (not CTS), then all I will need to do is fix that function, then this issue is solved. Thanks.

meowcat commented 4 years ago

So: https://github.com/MassBank/RMassBank/createMassBank.R Line 577 is where we call gatherPubChem to get data off PubChem. https://github.com/MassBank/RMassBank/blob/611b78578b54156119080b57569c09586a18fe84/R/createMassBank.R#L577

Lines 602..608 is where we get CTS data. To do: what do we still need from CTS at this point?

https://github.com/MassBank/RMassBank/blob/611b78578b54156119080b57569c09586a18fe84/R/createMassBank.R#L602-L608

Lines 775-786 is where we decide which PubChem ID to use. I guess you want to drop CTS completely as an option? https://github.com/MassBank/RMassBank/blob/611b78578b54156119080b57569c09586a18fe84/R/createMassBank.R#L775-L786

Then the actual data retrieval from PubChem is in gatherPubChem, where getPcId is called: https://github.com/MassBank/RMassBank/blob/611b78578b54156119080b57569c09586a18fe84/R/createMassBank.R#L454-L466

getPcId is then the function in webAccess,R: https://github.com/MassBank/RMassBank/blob/611b78578b54156119080b57569c09586a18fe84/R/webAccess.R#L109-L144

schymane commented 4 years ago

@meowcat I've created a new branch: https://github.com/MassBank/RMassBank/tree/preferredPCIDs

I've added getPCIDs.CIDtype, adjusted getPcId and createMassBank.R (and updated my old email address). I'm stuck on the documentation - see emails.

schymane commented 4 years ago

Just pushed https://github.com/MassBank/RMassBank/commit/445b43243036b18aad3c5343260652eb4945f9cf thanks to @MaliRemorker for docs tips. @meowcat pls let me know if I should do a pull request (this results in a lot of changes), or if you want me to change anything?

tsufz commented 11 months ago

Default source of CIDs is PubChem, so could be closed.