Closed aswarren closed 4 years ago
Hi Andrew,
thanks for reporting this issue, I had never come across this problem before...
If I remember it correctly then Cutadapt was writing some things to STDOUT, and other things (such as the trimming histogram stats) to STDERR (at least this used to be the case in 2010 or so...). Do you think we should keep it in mind and change it later if needed (maybe via a PR?) and you keep your working version for the moment?
Hi Felix, The command information from trimgalore/cutadapt is:
trim_galore --gzip -j 6 -o ./fqutil_trim_test/stage ./barcode.fastq.gz
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 2.2
So my version isn't quite so old (April 2019). Using trimgalore "cb1b287c4ad2c09b7769be2ba072ec7fa6f38d8e" prior to modifications.
To my knowledge the changes I made don't change the behavior of trimgalore aside from the generation of an additional log file for the stderr of cutadapt. I'm happy to generate a PR but thought I would wait for your interpretation.
The input is single end 4.6Gb fastq compressed (I can provide you the data through another medium). We are using your tool here https://www.patricbrc.org/app/FastqUtil
All of the output from cutadapt seems to be generated it just never moves on from somewhere here I suspect https://github.com/FelixKrueger/TrimGalore/blob/master/trim_galore#L1284-L1356
With my previously linked commits it does move on.
Other trimgalore output:
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Nextera 42 CTGTCTCTTATA 351198 0.01
smallRNA 27 TGGAATTCTCGG 351198 0.01
Illumina 3 AGATCGGAAGAGC 351198 0.00
Using Nextera adapter for trimming (count: 42). Second best hit was smallRNA (count: 27)
Hi Andrew,
Hmm, I somehow get the feeling that something weird is going on your side, but I can't really say for sure... I've never heard of the process stalling so far, and I have processed a fair amount of files in my time, some of which being several hundred GB big... Would you mind sharing the file that is causing this for you? I can send you details for an FTP if required.
Just to ask again quickly, does it work with the latest version of Trim Galore (0.6.5 or the dev version?), and are you using Python3 and do you have pigz
installed? If not, could you drop the -j 6
and see if that helps? (I think -j 4
is a sweet spot by the way, -j 6
doesn't add much extra).
Cheers, Felix
Thanks for the note on the cores. The file should be accessible with:
curl 'https://p3.theseed.org/services/WorkspaceDownload/CraFkOjH6hG_Y0fqaC4GdA/trimgalore_test_block_barcode.fastq.gz' --compressed -H 'Connection: keep-alive' -H 'Referer: https://www.patricbrc.org/workspace/public/anwarren@patricbrc.org/AndrewWarrenPublic/TestFolder' -H 'Upgrade-Insecure-Requests: 1' --output ./test.fq.gz
If that doesn't work you can get it here with an account:
https://www.patricbrc.org/workspace/public/anwarren@patricbrc.org/AndrewWarrenPublic/TestFolder
I moved from a CentOS box to a Ubuntu box and have confirmed the behavior with this file. I was a bit cryptic before, the commit hash I gave is the latest on the git commit history for this repo "cb1b287c4ad2c09b7769be2ba072ec7fa6f38d8e" In my latest test on Ubuntu the information is as follows (no pigz):
python --version
Python 3.7.6
cutadapt --version
2.10
./trim_galore --version
Quality-/Adapter-/RRBS-/Speciality-Trimming
[powered by Cutadapt]
version 0.6.5dev
Last update: 05 05 2020
lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 19.10
Release: 19.10
Codename: eoan
Hi Andrew,
Sorry for the long response time, I was stuck in talks and training sessions all day.
While I don't think I've got the perfect solution for you, I learned a few interesting facts:
FastQC
:fastqc trimgalore_test_block_barcode.fastq
Started analysis of trimgalore_test_block_barcode.fastq
Approx 5% complete for trimgalore_test_block_barcode.fastq
Approx 10% complete for trimgalore_test_block_barcode.fastq
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:33)
at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.processSequence(PerBaseQualityScores.java:141)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:89)
at java.base/java.lang.Thread.run(Unknown Source)
Interestingly, FastQC only stalls and never completes, but does not terminate. This happens regardless of whether the file is compressed with gzip
or not.
Specifying more threads (e.g. fastqc -t 2
or more) allows FastQC to run to completion!
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
This would suggest that the reads are so long that you the standard 250MB per thread is not sufficient for this data.
-- the read lengths are indeed very long: 91-91655. I am pretty sure that this is causing the issues just like in FastQC
-- as this seems to be ONT data, a LOT of the qualities are fairly low. This is probably OK, and does not need to be trimmed. I am sure, ONT downstream programs take this into account somehow
-- ONT dat does neither contain the Illumina, Nextera or smallRNA adapters (as it is ONT data). Therefore, I would argue that you really don't to use Trim Galore on this data at all, as it does all the wrong things!
pigz
is required for multi-core trimming. If you don't have it installed, please don't use -j
> 1So bottom line: There is a good chance that open3
is indeed causing some kind of blocking or memory issues that prevent Trim Galore from trimming this particular data set. I would however argue that ONT data should not at all be trimmed in this way anyway. Does that make sense?
Interesting! It does make sense. I believe it is ONT but I'm still seeking confirmation. Would it make sense to use the functionality of FastQC (assuming a fix to the current issues) to identify over-represented sequences as an input for trimming in cutadapt? I believe there is some barcoding that could be removed in this case. If FastQC isn't returning I'm not sure why the change I made solves the problem, though it does seem to prevent the command from hanging indefinitely. Do you think a downstream bug is warranted for FastQC? Thanks for looking into this and for creating such a useful tool.
Hi Andrew,
I'm afraid I don't know all that much about Nanopore data, and how it is trimmed and used further downstream. I suppose some googling or literature search should quickly give you an idea about the current steps to go about this data. My gut feeling would be though that a standard trimming for short reads is probably not warranted, especially giving the error profiles of the platform.
Regarding FastQC, I am not sure what you were thinking of as a downstream bug? The issue is probably simply that reads of 100kb just cost more memory than the default of 250MB that is used per thread, so using a higher value of -t
seems to solve the issue.
All the best, Felix
The app seems to hang due to blocking issues when trying to read from stdout and stderr of cutadapt (my guess is for sufficiently large inputs). Blocking issues mentioned here https://perldoc.perl.org/IPC/Open3.html
Since stderr doesn't need to be streamed I switched to open2 in a workaround here. https://github.com/aswarren/TrimGalore/commit/fbee08d6692af877fd968c1696dfd262449a6a79