CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Does concatenating fastq files cause any issues #565

Closed Joseph-Waldron closed 2 years ago

Joseph-Waldron commented 2 years ago

Hi there

I have just started using this tool and am finding it very useful so thanks very much.

I just wanted to check one thing before I proceed. I have some data where I have sequenced my samples across 3 different runs, which I am then concatenating the fastq files using the following example command

$ cat sample1a.fastq sample1b.fastq sample1c.fastq > sample1.fastq

Will this concatenated fastq file be OK as input into this pipeline, or will there be an issue with the header lines being slightly different depending on which run those reads came from and if so is there a solution to this

Many thanks in advance Joe

TomSmithCGAT commented 2 years ago

No problem with that. It's common to merge samples from e.g muliple lanes of sequencing, in which case the fastq header lines differ with respect to the part indicating the lane the read originiated from. Ultimately, the fastq header line information (machine, lane, flowcell coordinate) is a unique identifier for the read and there's no requirement for any part to be consistently the same in the downstream tools (at least the ones I've used to date).

You also be able to use process substitution, e.g umi_tools extract --stdin= <( cat sample1a.fastq sample1b.fastq sample1c.fastq ) to avoid creating the intermediate concatented fastq file.

Joseph-Waldron commented 2 years ago

OK that's great and a very useful suggestion. Thanks for getting back to me so soon