FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
467 stars 151 forks source link

Stdout stderr blocking open3 #96

Closed aswarren closed 4 years ago

aswarren commented 4 years ago

The app seems to hang due to blocking issues when trying to read from stdout and stderr of cutadapt (my guess is for sufficiently large inputs). Blocking issues mentioned here https://perldoc.perl.org/IPC/Open3.html

Since stderr doesn't need to be streamed I switched to open2 in a workaround here. https://github.com/aswarren/TrimGalore/commit/fbee08d6692af877fd968c1696dfd262449a6a79

FelixKrueger commented 4 years ago

Hi Andrew,

thanks for reporting this issue, I had never come across this problem before...

If I remember it correctly then Cutadapt was writing some things to STDOUT, and other things (such as the trimming histogram stats) to STDERR (at least this used to be the case in 2010 or so...). Do you think we should keep it in mind and change it later if needed (maybe via a PR?) and you keep your working version for the moment?

aswarren commented 4 years ago

Hi Felix, The command information from trimgalore/cutadapt is:

trim_galore --gzip -j 6 -o ./fqutil_trim_test/stage ./barcode.fastq.gz
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 2.2 

So my version isn't quite so old (April 2019). Using trimgalore "cb1b287c4ad2c09b7769be2ba072ec7fa6f38d8e" prior to modifications.

To my knowledge the changes I made don't change the behavior of trimgalore aside from the generation of an additional log file for the stderr of cutadapt. I'm happy to generate a PR but thought I would wait for your interpretation.

The input is single end 4.6Gb fastq compressed (I can provide you the data through another medium). We are using your tool here https://www.patricbrc.org/app/FastqUtil

All of the output from cutadapt seems to be generated it just never moves on from somewhere here I suspect https://github.com/FelixKrueger/TrimGalore/blob/master/trim_galore#L1284-L1356

With my previously linked commits it does move on.

Other trimgalore output:

Found perfect matches for the following adapter sequences:
Adapter type    Count   Sequence        Sequences analysed      Percentage
Nextera 42      CTGTCTCTTATA    351198  0.01
smallRNA        27      TGGAATTCTCGG    351198  0.01
Illumina        3       AGATCGGAAGAGC   351198  0.00
Using Nextera adapter for trimming (count: 42). Second best hit was smallRNA (count: 27)
FelixKrueger commented 4 years ago

Hi Andrew,

Hmm, I somehow get the feeling that something weird is going on your side, but I can't really say for sure... I've never heard of the process stalling so far, and I have processed a fair amount of files in my time, some of which being several hundred GB big... Would you mind sharing the file that is causing this for you? I can send you details for an FTP if required.

Just to ask again quickly, does it work with the latest version of Trim Galore (0.6.5 or the dev version?), and are you using Python3 and do you have pigz installed? If not, could you drop the -j 6 and see if that helps? (I think -j 4 is a sweet spot by the way, -j 6 doesn't add much extra).

Cheers, Felix

aswarren commented 4 years ago

Thanks for the note on the cores. The file should be accessible with:

curl 'https://p3.theseed.org/services/WorkspaceDownload/CraFkOjH6hG_Y0fqaC4GdA/trimgalore_test_block_barcode.fastq.gz' --compressed -H 'Connection: keep-alive' -H 'Referer: https://www.patricbrc.org/workspace/public/anwarren@patricbrc.org/AndrewWarrenPublic/TestFolder' -H 'Upgrade-Insecure-Requests: 1' --output ./test.fq.gz

If that doesn't work you can get it here with an account:

https://www.patricbrc.org/workspace/public/anwarren@patricbrc.org/AndrewWarrenPublic/TestFolder

I moved from a CentOS box to a Ubuntu box and have confirmed the behavior with this file. I was a bit cryptic before, the commit hash I gave is the latest on the git commit history for this repo "cb1b287c4ad2c09b7769be2ba072ec7fa6f38d8e" In my latest test on Ubuntu the information is as follows (no pigz):

python --version
Python 3.7.6
cutadapt --version
2.10
./trim_galore --version

                        Quality-/Adapter-/RRBS-/Speciality-Trimming
                                [powered by Cutadapt]
                                  version 0.6.5dev

                               Last update: 05 05 2020

lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan
FelixKrueger commented 4 years ago

Hi Andrew,

Sorry for the long response time, I was stuck in talks and training sessions all day.

While I don't think I've got the perfect solution for you, I learned a few interesting facts:

fastqc trimgalore_test_block_barcode.fastq 
Started analysis of trimgalore_test_block_barcode.fastq
Approx 5% complete for trimgalore_test_block_barcode.fastq
Approx 10% complete for trimgalore_test_block_barcode.fastq
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
    at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:33)
    at uk.ac.babraham.FastQC.Modules.PerBaseQualityScores.processSequence(PerBaseQualityScores.java:141)
    at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:89)
    at java.base/java.lang.Thread.run(Unknown Source)

Interestingly, FastQC only stalls and never completes, but does not terminate. This happens regardless of whether the file is compressed with gzip or not.

Specifying more threads (e.g. fastqc -t 2 or more) allows FastQC to run to completion!

  -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine

This would suggest that the reads are so long that you the standard 250MB per thread is not sufficient for this data.

-- the read lengths are indeed very long: 91-91655. I am pretty sure that this is causing the issues just like in FastQC

-- as this seems to be ONT data, a LOT of the qualities are fairly low. This is probably OK, and does not need to be trimmed. I am sure, ONT downstream programs take this into account somehow

-- ONT dat does neither contain the Illumina, Nextera or smallRNA adapters (as it is ONT data). Therefore, I would argue that you really don't to use Trim Galore on this data at all, as it does all the wrong things!

So bottom line: There is a good chance that open3 is indeed causing some kind of blocking or memory issues that prevent Trim Galore from trimming this particular data set. I would however argue that ONT data should not at all be trimmed in this way anyway. Does that make sense?

aswarren commented 4 years ago

Interesting! It does make sense. I believe it is ONT but I'm still seeking confirmation. Would it make sense to use the functionality of FastQC (assuming a fix to the current issues) to identify over-represented sequences as an input for trimming in cutadapt? I believe there is some barcoding that could be removed in this case. If FastQC isn't returning I'm not sure why the change I made solves the problem, though it does seem to prevent the command from hanging indefinitely. Do you think a downstream bug is warranted for FastQC? Thanks for looking into this and for creating such a useful tool.

FelixKrueger commented 4 years ago

Hi Andrew,

I'm afraid I don't know all that much about Nanopore data, and how it is trimmed and used further downstream. I suppose some googling or literature search should quickly give you an idea about the current steps to go about this data. My gut feeling would be though that a standard trimming for short reads is probably not warranted, especially giving the error profiles of the platform.

Regarding FastQC, I am not sure what you were thinking of as a downstream bug? The issue is probably simply that reads of 100kb just cost more memory than the default of 250MB that is used per thread, so using a higher value of -t seems to solve the issue.

All the best, Felix