Open meowcat opened 5 months ago
Do we have any idea what is the longest InChI possible ? Todays InChI is defined as max, 1000 atoms, I read somewhere about extension to 65K. The longest in https://zenodo.org/record/6503754/files/PubChemLite_exposomics_20220429.csv has a maximum length of 3593 for a DNA snippet with 873 atoms: DFYPFJSPLUVPFJ-QJEDTDQSSA-N
How is that InChIKey valid? It has too many sections? (copy paste issue?)
The URL redirects OK tho (DFYPFJSPLUVPFJ-QJEDTDQSSA-N)
I thought we trimmed PCL to ~2000 but it seems that's sneaking through (MW 8000)? It is only in PCL due to this small bit of annotation: https://pubchem.ncbi.nlm.nih.gov/compound/DFYPFJSPLUVPFJ-QJEDTDQSSA-N#section=Drug-and-Medication-Information
@PaulThiessen might be able to answer the InChI length question for you, I am not sure ...
I'm not actually sure about atom limits in regular InChI, but PubChem has a limit of 999 atoms (including H) for compounds (historically because that's the limit of the MOL/SDF V2000 format).
I don't think there's any particular length limit for the full InChI string. The longest one in PubChem is 4789 characters (CID 160332983).
Indeed the visible InChIkey was cut&paste leftover. Fixed now. The InChI specs mention a limit of 1024 atoms on p18. https://www.inchi-trust.org/download/104/InChI_UserGuide.pdf Yours, Steffen
That number is surely not coincidental ... @PaulThiessen do you know if that changed in more recent versions (that documentation was 1.04, you're now on 1.06 or 1.07 right?). I never get those log files when generating InChIs ...
We're using 1.06, although 1.07 is in the works and will be out soon. I'll ask the InChI folks directly what the current atom limit is.
Ok yes standard InChI in current versions still has a limit of 1024 atoms.
Thanks Paul!
For records with very long InChI codes, the importer doesn't fail gracefully. No validation problems are encountered, but the import crashes while trying to write the InChI code to the DB. As a result, zero records end up in the DB. CH_IUPAC is a VARCHAR(1200).
Expected behaviour: 1) the validator should catch the problem (though it is strictly speaking debatable because per se the MassBank record spec doesn't specify a maximal length) 2) the database import should skip the problematic records
Find attached a record set of five records where one causes this problem. Note: This is a work in progress dataset used in-house and derived from Florian Huber's dataset https://zenodo.org/records/10160791 (I hope this note and the CC BY in the records fulfill the CC BY requirements...) records.tar.gz