MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

Long InChI codes crash the database refresh #401

Open meowcat opened 5 months ago

meowcat commented 5 months ago

For records with very long InChI codes, the importer doesn't fail gracefully. No validation problems are encountered, but the import crashes while trying to write the InChI code to the DB. As a result, zero records end up in the DB. CH_IUPAC is a VARCHAR(1200).

Expected behaviour: 1) the validator should catch the problem (though it is strictly speaking debatable because per se the MassBank record spec doesn't specify a maximal length) 2) the database import should skip the problematic records

[+] Creating 1/0
 ✔ Container 5-mariadb-1  Running                                                                                                                                                                         0.0s 
RefreshDatabase version: 2.2.6-SNAPSHOT
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
5 records to send to database. 80% Done.java.sql.SQLSyntaxErrorException: (conn=17159) Data too long for column 'CH_IUPAC' at row 1
        at org.mariadb.jdbc.export.ExceptionFactory.createException(ExceptionFactory.java:282)
        at org.mariadb.jdbc.export.ExceptionFactory.create(ExceptionFactory.java:370)
        at org.mariadb.jdbc.message.ClientMessage.readPacket(ClientMessage.java:134)
        at org.mariadb.jdbc.client.impl.StandardClient.readPacket(StandardClient.java:883)
        at org.mariadb.jdbc.client.impl.StandardClient.readResults(StandardClient.java:822)
        at org.mariadb.jdbc.client.impl.StandardClient.readResponse(StandardClient.java:741)
        at org.mariadb.jdbc.client.impl.StandardClient.execute(StandardClient.java:665)
        at org.mariadb.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:92)
        at org.mariadb.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:337)
        at org.mariadb.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:314)
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
        at massbank.db.DatabaseManager.persistAccessionFile(DatabaseManager.java:326)
        at massbank.cli.RefreshDatabase.lambda$main$0(RefreshDatabase.java:67)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Find attached a record set of five records where one causes this problem. Note: This is a work in progress dataset used in-house and derived from Florian Huber's dataset https://zenodo.org/records/10160791 (I hope this note and the CC BY in the records fulfill the CC BY requirements...) records.tar.gz

sneumann commented 5 months ago

Do we have any idea what is the longest InChI possible ? Todays InChI is defined as max, 1000 atoms, I read somewhere about extension to 65K. The longest in https://zenodo.org/record/6503754/files/PubChemLite_exposomics_20220429.csv has a maximum length of 3593 for a DNA snippet with 873 atoms: DFYPFJSPLUVPFJ-QJEDTDQSSA-N

schymane commented 5 months ago

How is that InChIKey valid? It has too many sections? (copy paste issue?)

The URL redirects OK tho (DFYPFJSPLUVPFJ-QJEDTDQSSA-N)

I thought we trimmed PCL to ~2000 but it seems that's sneaking through (MW 8000)? It is only in PCL due to this small bit of annotation: https://pubchem.ncbi.nlm.nih.gov/compound/DFYPFJSPLUVPFJ-QJEDTDQSSA-N#section=Drug-and-Medication-Information

@PaulThiessen might be able to answer the InChI length question for you, I am not sure ...

PaulThiessen commented 5 months ago

I'm not actually sure about atom limits in regular InChI, but PubChem has a limit of 999 atoms (including H) for compounds (historically because that's the limit of the MOL/SDF V2000 format).

I don't think there's any particular length limit for the full InChI string. The longest one in PubChem is 4789 characters (CID 160332983).

sneumann commented 5 months ago

Indeed the visible InChIkey was cut&paste leftover. Fixed now. The InChI specs mention a limit of 1024 atoms on p18. https://www.inchi-trust.org/download/104/InChI_UserGuide.pdf Yours, Steffen

schymane commented 5 months ago

That number is surely not coincidental ... @PaulThiessen do you know if that changed in more recent versions (that documentation was 1.04, you're now on 1.06 or 1.07 right?). I never get those log files when generating InChIs ...

image

PaulThiessen commented 5 months ago

We're using 1.06, although 1.07 is in the works and will be out soon. I'll ask the InChI folks directly what the current atom limit is.

PaulThiessen commented 5 months ago

Ok yes standard InChI in current versions still has a limit of 1024 atoms.

schymane commented 5 months ago

Thanks Paul!