Closed oschwengers closed 4 years ago
What happens with either R1 or R2 reads resulting from successful filtering procedures but lacking a related mate?
If we output R1.clean.fq
and R2.clean.fq
separately, the unpaired reads should be discarded to keep the read name consistent.
Maybe my question was not precise enough. Currently, after passing all quality filters, only valid R1/R2 pairs are written to separate R1.clean.fq and R2.clean.fq files in an ordered manner with consistent read names.
But if for instance R1 passes the quality filters, but R2 of this read pair does not... fastp currently discards the R1 as well. Couldn't you output the passing mate (R1 in this case) to a distinct file only containing valid but unpaired reads? Thus, more high-quality reads can be used in subsequent assemblies, for instance. Of course, the passing unpaired reads are not ordered anymore, but this would be completely acceptable anyways.
What are your thoughts regarding the initial question?
Hi, a follow up...hopefully this clarifies what I am trying to ask/suggest. Below you find an extract from the SPAdes manual of how to provide PE data:
Paired-end libraries
--pe<#>-1 <file_name>
File with left reads for paired-end library number <#> (<#> = 1,2,..,9).
--pe<#>-2 <file_name>
File with right reads for paired-end library number <#> (<#> = 1,2,..,9).
--pe<#>-m <file_name>
File with merged reads from paired-end library number <#> (<#> = 1,2,..,9)
If the properties of the library permit, paired reads can be merged using special software. Non-empty files with (remaining) unmerged left/right reads (separate or interlaced) must be provided for the same library for SPAdes to correctly detect the original read length.
--pe<#>-s <file_name>
File with unpaired reads from paired-end library number <#> (<#> = 1,2,..,9)
For example, paired reads can become unpaired during the error correction procedure.
Currently, fastp can serve the R1 (pe-1) and R2 (pe-2) reads after filtering and recently, also merged R12 (pe-m).
In order to utilize as much genome information as possible, it would be great if fastp could:
If I got it right, in step 5 R2 reads would have to be reverse complimented before writing them to s.fastq along with residual R1 reads.
Unforutnately, I dont have much knowledge about fastp internals, but by this procedure, fastp could provide merged, paired and unpaired reads in a single step. This would definetly be very beneficial for assembies.
What do you think? Thanks a lot for considering and best regards!
Hi,
Thanks for your suggestion. I completely understand what you want. But it will make fastp too complicated if we implement all output. People hate to use complicated tools.
I may first consider to add options to output unpaired R1 and R2.
Hi, sry for bothering ;-) that's a solid point and I totally agree with you that tools should not be too complicated!
But i guess there isn't so much of additional complexity involved here...
Fastp knows when it deals with PE data due to the -i <input-1-file>
/ -I <input-2-file>
parameters
-o <output-1-file>
/ -O <output-2-file>
-m
. All you need to add is a related argument, so it would become -m <merged-output-file>
-s <residual-output-file>
So everything that needs to be changed would be a new argument to the -m
parameter...
...and in addition, a new optional (and distinct from the merged read functionality) parameter/argument -s <residual-output-file>
could be added.
I see that at least the second one can't be implemented right away and might take a while. But I would love to see this on the agenda because, by fulfilling these requirements, fastp would make a giant leap in terms your an ultra-fast all-in-one FASTQ preprocessor credo. Otherwise one had to use several tools or restart fastp with different parameters which is clearly not the way to go... Best regards!
Just to jump in.
The issue with the unmated reads being included in the clean fastq is one I just bumped into and I definitely agree with you.
So i'm thinking that the unmated reads should go in a separate junk file, essentially R1 no pass.fastq and R2 no pass.fastq. That can include the reads that don't pass other filters as well i think.
Should resolve a lot of pain (at least for me).
Also, very nice tool, especially because its clean and simple to use, which is key.
Just a quick update:
I am implementing the new features, and the new fastp will support 7 separate output files:
1 merged.fq
2 unmerged.R1.fq
3 unmerged.R2.fq
4 unpaired.R1.fq
5 unpaired.R2.fq
6 failed.fq
And a --include_unmerged
option will be provided to redirect 2~5 to 1.
4 and 5 can be the same file, as well as 6 and 7.
Awesome! thank you so much, this will definitely be super useful
Hi guys, this feature is implemented in fastp v0.19.9 (will be released soon), see the update here:
For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge
option. In this merging
mode:
--merged_out
shouuld be given to specify the file to store merged reads, otherwise you should enable --stdout
to stream the merged reads to STDOUT. The merged reads are also filtered.--out1
and --out2
will be the reads that cannot be merged successfully, but both pass all the filters.--unpaired1
will be the reads that cannot be merged, read1
passes filters but read2
doesn't.--unpaired2
will be the reads that cannot be merged, read2
passes filters but read1
doesn't.--include_unmerged
can be enabled to make reads of --out1
, --out2
, --unpaired1
and --unpaired2
redirected to --merged_out
. So you will get a single output file. This option is disabled by default.--failed_out
can still be given to store the reads (either merged or unmerged) failed to passing filters.
In the output file, a tag like merged_xxx_yyy
will be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp
prefers the bases in read1 since they usually have higher quality than read2.
This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30)
and overlap_diff_limit (default 5)
.
Thank you so much! Also for the super fast implementation, Awesome job! The new features and command line UI open new possibilities...
Hi, just found a small typo in the usage:
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --umpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
-> ...same as --umpaired1 (default mode)...
I just saw that this is still open. I'll create a pull request with a typo fix and close this one. Again, thanks a lot for this!
Hi, thanks for recently adding the merging feature of PE reads #139 !
Would it be possible to implement a 3rd output parameter in order to separate the merged/unmerged reads in an intuitive way like assemblers (e.g. SPAdes) ask for them?
So, we could immediately end up with cleaned R1, R2 and merged read files.
For example:
Maybe worth of a distinct issue but somehow related to this: What happens with either R1 or R2 reads resulting from successful filtering procedures but lacking a related mate? Could these also be written to a distinct file, like SE reads? Trimmomatic is able to behave that way and thus, those valid reads can be fed into downstream analyses.