OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.91k stars 333 forks source link

Distinct output file for merged PE reads #147

Closed oschwengers closed 4 years ago

oschwengers commented 5 years ago

Hi, thanks for recently adding the merging feature of PE reads #139 !

Would it be possible to implement a 3rd output parameter in order to separate the merged/unmerged reads in an intuitive way like assemblers (e.g. SPAdes) ask for them?

So, we could immediately end up with cleaned R1, R2 and merged read files.

For example:

fastp -i R1.fastq.gz -I R2.fastq.gz -o R1.clean.fastq.gz -O R2.clean.fastq.gz --out_merged R12.clean.fastq.gz

Maybe worth of a distinct issue but somehow related to this: What happens with either R1 or R2 reads resulting from successful filtering procedures but lacking a related mate? Could these also be written to a distinct file, like SE reads? Trimmomatic is able to behave that way and thus, those valid reads can be fed into downstream analyses.

sfchen commented 5 years ago

What happens with either R1 or R2 reads resulting from successful filtering procedures but lacking a related mate?

If we output R1.clean.fq and R2.clean.fq separately, the unpaired reads should be discarded to keep the read name consistent.

oschwengers commented 5 years ago

Maybe my question was not precise enough. Currently, after passing all quality filters, only valid R1/R2 pairs are written to separate R1.clean.fq and R2.clean.fq files in an ordered manner with consistent read names.

But if for instance R1 passes the quality filters, but R2 of this read pair does not... fastp currently discards the R1 as well. Couldn't you output the passing mate (R1 in this case) to a distinct file only containing valid but unpaired reads? Thus, more high-quality reads can be used in subsequent assemblies, for instance. Of course, the passing unpaired reads are not ordered anymore, but this would be completely acceptable anyways.

What are your thoughts regarding the initial question?

oschwengers commented 5 years ago

Hi, a follow up...hopefully this clarifies what I am trying to ask/suggest. Below you find an extract from the SPAdes manual of how to provide PE data:

Paired-end libraries
--pe<#>-1 <file_name> 
    File with left reads for paired-end library number <#> (<#> = 1,2,..,9).

--pe<#>-2 <file_name> 
    File with right reads for paired-end library number <#> (<#> = 1,2,..,9).

--pe<#>-m <file_name> 
    File with merged reads from paired-end library number <#> (<#> = 1,2,..,9) 
    If the properties of the library permit, paired reads can be merged using special software.     Non-empty files with (remaining) unmerged left/right reads (separate or interlaced) must be provided for the same library for SPAdes to correctly detect the original read length.

--pe<#>-s <file_name> 
    File with unpaired reads from paired-end library number <#> (<#> = 1,2,..,9) 
    For example, paired reads can become unpaired during the error correction procedure.

Currently, fastp can serve the R1 (pe-1) and R2 (pe-2) reads after filtering and recently, also merged R12 (pe-m).

In order to utilize as much genome information as possible, it would be great if fastp could:

  1. apply the various quality filters and correction
  2. try to merge R1 and R2
  3. write merged R12 (pe-m) reads to one file, e.g. m.fastq
  4. write unmerged paired reads to R1 (pe-1) and R2 (pe-2) files, e.g. R1.fastq, R2.fastq
  5. write unmerged unpaired reads (pe-s) to another file, e.g. s.fastq

If I got it right, in step 5 R2 reads would have to be reverse complimented before writing them to s.fastq along with residual R1 reads.

Unforutnately, I dont have much knowledge about fastp internals, but by this procedure, fastp could provide merged, paired and unpaired reads in a single step. This would definetly be very beneficial for assembies.

What do you think? Thanks a lot for considering and best regards!

sfchen commented 5 years ago

Hi,

Thanks for your suggestion. I completely understand what you want. But it will make fastp too complicated if we implement all output. People hate to use complicated tools.

I may first consider to add options to output unpaired R1 and R2.

oschwengers commented 5 years ago

Hi, sry for bothering ;-) that's a solid point and I totally agree with you that tools should not be too complicated!

But i guess there isn't so much of additional complexity involved here...

Fastp knows when it deals with PE data due to the -i <input-1-file> / -I <input-2-file> parameters

So everything that needs to be changed would be a new argument to the -m parameter...

...and in addition, a new optional (and distinct from the merged read functionality) parameter/argument -s <residual-output-file> could be added.

I see that at least the second one can't be implemented right away and might take a while. But I would love to see this on the agenda because, by fulfilling these requirements, fastp would make a giant leap in terms your an ultra-fast all-in-one FASTQ preprocessor credo. Otherwise one had to use several tools or restart fastp with different parameters which is clearly not the way to go... Best regards!

dborgesr commented 5 years ago

Just to jump in.

The issue with the unmated reads being included in the clean fastq is one I just bumped into and I definitely agree with you.

So i'm thinking that the unmated reads should go in a separate junk file, essentially R1 no pass.fastq and R2 no pass.fastq. That can include the reads that don't pass other filters as well i think.

Should resolve a lot of pain (at least for me).

Also, very nice tool, especially because its clean and simple to use, which is key.

sfchen commented 5 years ago

Just a quick update:

I am implementing the new features, and the new fastp will support 7 separate output files:

1 merged.fq
2 unmerged.R1.fq
3 unmerged.R2.fq
4 unpaired.R1.fq
5 unpaired.R2.fq
6 failed.fq

And a --include_unmerged option will be provided to redirect 2~5 to 1.

4 and 5 can be the same file, as well as 6 and 7.

dborgesr commented 5 years ago

Awesome! thank you so much, this will definitely be super useful

sfchen commented 5 years ago

Hi guys, this feature is implemented in fastp v0.19.9 (will be released soon), see the update here:

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15 means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

oschwengers commented 5 years ago

Thank you so much! Also for the super fast implementation, Awesome job! The new features and command line UI open new possibilities...

oschwengers commented 5 years ago

Hi, just found a small typo in the usage:

      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --umpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])

-> ...same as --umpaired1 (default mode)...

oschwengers commented 4 years ago

I just saw that this is still open. I'll create a pull request with a typo fix and close this one. Again, thanks a lot for this!