PacificBiosciences / pb-metagenomics-tools

Tools and pipelines tailored to using PacBio HiFi Reads for metagenomics
BSD 3-Clause Clear License
168 stars 34 forks source link

SQLite_TOOBIG error SAM2RMA #77

Closed MicroSeq closed 2 months ago

MicroSeq commented 4 months ago

Please leave the headers below and replace the relevant text. This will expedite addressing the ticket.

Name the workflow Taxonomic-Profling-Diamond-Megan

Describe the bug

The workflow is crashing at the MakeRMAUnfiltered/Filtered step. It appears there is a SQLite error, I tried increasing the available memory to MEGAN (~400 Gb) and that did not resolve the issue. Possibly, I need to reduce the top hits at the diamond blastx step (currently at --top 10)? Perhaps I can tweak the SAM file to avoid having re-do the alignments to allow for this to be manageable?

Relevant section of the log file:

Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0 10% 20% 30% 40% Caught: org.sqlite.SQLiteException: [SQLITE_TOOBIG] String or BLOB exceeds size limit (statement too long) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.DB.newSQLException(DB.java:1135) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.DB.newSQLException(DB.java:1146) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.DB.throwex(DB.java:1106) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.NativeDB.prepare_utf8(Native Method) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.NativeDB.prepare(NativeDB.java:122) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.core.DB.prepare(DB.java:264) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.jdbc3.JDBC3Statement.lambda$executeQuery$1(JDBC3Statement.java:75) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.jdbc3.JDBC3Statement.withConnectionTimeout(JDBC3Statement.java:429) at org.xerial.sqlitejdbc@3.39.3.0/org.sqlite.jdbc3.JDBC3Statement.executeQuery(JDBC3Statement.java:73) at megan/megan.accessiondb.AccessAccessionMappingDatabase.getValues(AccessAccessionMappingDatabase.java:222) at megan/megan.rma6.RMA6FromBlastCreator.parseFiles(RMA6FromBlastCreator.java:257) at megan/megan.tools.SAM2RMA6.createRMA6FileFromSAM(SAM2RMA6.java:340) at megan/megan.tools.SAM2RMA6.run(SAM2RMA6.java:307) at megan/megan.tools.SAM2RMA6.main(SAM2RMA6.java:69)

Expected behavior

Screenshots image

Log files

NitrifyingCombined.MakeRMA.filtered.readCount.log

MicroSeq commented 4 months ago

I am wondering if it is a cache size issue, so I am trying to re-run it with sam2rma set to 5000 instead of 10000 as the default.

https://megan.cs.uni-tuebingen.de/t/sqlite-toobig-error-from-daa2rma/2212/5

Update: crashed at the same point with the decreased cache size.

dportik commented 3 months ago

Hi @MicroSeq , I think I recall seeing this issue when I was running the pipeline for MAGs, rather than reads. I think because MEGAN is putting the nucleotide sequences into the database, it ran into memory limitations. Can you confirm what the size distribution is for the sequences you are using as inputs?

MicroSeq commented 3 months ago

Hey @dportik , I was able to get the pipeline to complete using --top5 instead of --top10 for the hits. If you need to know the size distribution for troubleshooting purposes, I can go back and check, but it may be a bit shorter than typical as the DNA quality was not amazing. My understanding is that the libraries were size selected for the SMRTbell 3.0 metagenomics protocol.

dportik commented 3 months ago

Hi @MicroSeq, Sounds frustrating, but thanks for the update. MEGAN has quite a few limitations and wasn't designed with large volumes of long-read data in mind.

Can I ask if you were using the workflow primarily for taxonomic annotation or the functional profiling?