OpenGene / fastp

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
MIT License
1.93k stars 333 forks source link

PhiX spike-in #225

Open ignadb opened 4 years ago

ignadb commented 4 years ago

Hi,

Thanks for developing fastp! I was wondering if it detects and removes PhiX spike-in by default?

Thanks in advance!

mschilli87 commented 4 years ago

@ignadb: How did you obtain the FASTQ files? If you use bcl2fastq to convert your raw Illumina data, PhiX reads should end up in the 'undetermined' fraction as they don't have a sample barcode used for demultiplexing. Thus, none of the actual samples should contain any PhiX. Or do you run the undetermined reads through fastp?

chatchai-kosawang commented 4 years ago

@mschilli87 Thanks for your comment. Perhaps I am overcautious but I always check for Phix contamination.

sfchen commented 4 years ago

That's right. Dot worry about PhiX reads, which are removed by bcl2fastq

nvpatin commented 4 years ago

This is an important comment that should not be ignored. PhiX can end up in the pre-processed reads for a variety of reasons. It would be great if a PhiX decontamination feature were added.

davised commented 4 years ago

Removing PhiX is especially important for reads that are used in de novo assemblies. Which is also when one will likely be using a trimming/QC tool like fastp.

I gather the design philosophy of fastp currently is "Set good defaults so users don't have to." Removing PhiX without users having to think about it is a good idea.

In a perfect world, bcl2fastq should remove all PhiX, but a small fraction of PhiX reads get assigned to samples. In my testing, it's usually between 0-100 reads per multiplexed sample, but I have had a few examples of several thousand reads mapping to PhiX. It's more of a 'better safe than sorry' situation.

I'll also stress that NCBI will not take assemblies that have detected PhiX contamination - those contigs/scaffolds that have PhiX must be removed prior to acceptance into the NCBI assembly database.

Edit - Also, in preps that use a single index (i7) vs dual index (i5 + i7), PhiX contamination is much more of a problem. Especially as use of low cost sequencing like the SeqWell platform that can use only i7 for reduced cost. I get in the thousands of PhiX per sample in that instance.