MikkelSchubert / adapterremoval

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging
http://adapterremoval.readthedocs.io/
GNU General Public License v3.0
105 stars 23 forks source link

Add option for singular 'combined' output FASTQ files #52

Open jfy133 opened 3 years ago

jfy133 commented 3 years ago

Currently, AdapterRemoval2 offers many different output 'streams' for each way a set of input files can be processed. In particular, quality trimmed vs non-quality trimmed get their own output flags.

However, this makes it difficult for downstream pipeline developers to define exactly what output files should be expected/used for subsequent analysis. i.e. there are so many useful options in AR2, but each combination produces different combinations of output files which can be hard to work out which to use (and makes lots of code duplication in pipeline processes e.g. in nf-core/eager, 9 different separate command statements rather than just using dynamic variable input: https://github.com/nf-core/eager/blob/de38b07149d3dabdfa38b0014c4126b2fe17ca12/main.nf#L855-L971, ).

It would helpful for an option that produces a 'single' FASTQ file with all valid output (i.e. not discarded), based on the parameters set by the user.

For example:

(etc.)

One addition

This would highly simplify a lot of manual processing that has to be done by pipelines/users.

jfy133 commented 3 years ago

Ping @apeltzer for tracking

MikkelSchubert commented 3 years ago

A first implementation of this request is now available in the master branch (v3 alpha), but see below.

The TL;DR is that you can accomplish what you want simply by specifying the same output filename for multiple output types (/dev/stdout if you want to pipe it):

AdapterRemoval --gzip \
    --file1 input_1.fq --file1 input_2.fq \
    --output1 output_interleaved_pe.fq.gz --output2 output_interleaved_pe.fq.gz \
    --outputmerged output_kinda_se.fq.gz --singleton output_kinda_se.fq.gz 

This will result in an interleaved file containing mate 1/2 reads and a file containing merged and singleton reads. You could also throw everything in one file, as I see you doing in the linked pipeline.

Another feature that might also make your life easier is the new gzip compression, which defaults to block-based compression using libdeflate to archive much higher throughput in single and (especially) multi-threaded mode. Assuming that your downstream tools are compatible with bgzip like gzip files (most are in my experience) then that could save both time and complexity in your pipeline by using that instead of a separate pigz step.

But as I said this is part of the v3 alpha and there are a number of breaking in already and more planned, so you won't be able to use it in your pipeline straight away. However, I would appreciate it if you could take a look and see if there are any problems I haven't thought of or obvious additions that would benefit you. I can write up a list of current and planned breaking changes, if that is helpful, which is something I have to at some point anyway.

The name-mangling step you carry out as part of your pipeline (AdapterRemovalFixPrefix) could probably be added, for example.

jfy133 commented 3 years ago

This is fantastic @MikkelSchubert, and also perfect timing as we are about to initiate a re-write of the eager pipeline as well, so thank you very much!

I will schedule time early next week to test this for you 👍

Edit: For the name-mangling thing, I personally wouldn't worry about that for now. We only use that for single downstream tool which we don't really recommend very often anymore anyway, so it wouldn't be worth the effort (I think anyway, up to you on that one of course ;))

jfy133 commented 3 years ago

@MikkelSchubert Sorry, I just looked through your commit history and it looks like there has been a lot of activity in the last few days. A list (or general summary) of changes would be good to have so I know what to expect if I do comparison with the latest stable release.

MikkelSchubert commented 3 years ago

I've added a summary of current breaking/major changes to the changelog:

https://github.com/MikkelSchubert/adapterremoval/blob/master/CHANGES.md