CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Could I keep the I1&R1&R2.fastq.gz as their original format after the extraction? #526

Closed yuw444 closed 2 years ago

yuw444 commented 2 years ago

It seems to me that the format fastq file has been changed, is there any way I can keep their original format after extraction.

Before: R1:

@ST-K00126:308:HFLYFBBXX:1:1101:31345:1261 1:N:0:NACCACCA
AACTCTTGTTCTGAACCGGTTGGGAT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJ

R2:

@ST-K00126:308:HFLYFBBXX:1:1101:31345:1261 2:N:0:NACCACCA
CCTTTTTGGAACCAACAATAGCAGCTCCATTTCTGGAGTCTGGGTCTTCCGAGGCCAGGAGCTCGCCTTTCCGCCGAGCCCAGATTGGCAGGTGGACT
+
A<<AFAFFFJJJJFJJJFA<7<FJF-AAJF7-FFF<FA7AFFFJ-77<JJFFFJJJJFAJFJ7-7AJ-7-FJJJ--)7-77F-F--AAAJAA-7-7F7

After extraction:

R1:

@ST-K00126:308:HFLYFBBXX:1:1101:31345:1261_AACTCTTGTTCTGAAC_CGGTTGGGAT 1:N:0:NACCACCA

+

R2:

@ST-K00126:308:HFLYFBBXX:1:1101:31345:1261_AACTCTTGTTCTGAAC_CGGTTGGGAT 2:N:0:NACCACCA
CCTTTTTGGAACCAACAATAGCAGCTCCATTTCTGGAGTCTGGGTCTTCCGAGGCCAGGAGCTCGCCTTTCCGCCGAGCCCAGATTGGCAGGTGGACT
+
A<<AFAFFFJJJJFJJJFA<7<FJF-AAJF7-FFF<FA7AFFFJ-77<JJFFFJJJJFAJFJ7-7AJ-7-FJJJ--)7-77F-F--AAAJAA-7-7F7

I saw CBs has been added in the topline after extraction in R2.filtered.fastq.gz, I could manually put their back to R1.filtered.fastq.gz file. However, the reading quality line has been deleted after extraction in R1.filtered.fastq.gz.

In my setting, I have to put the filtered.fastq.gz back to cellranger pipeline.

Please help.

Thanks so much.

TomSmithCGAT commented 2 years ago

Hi @yuw444 - I'm a bit confused by that example. The two reads shown are not the same 25834:1173 vs 31345:1261 in the fastq 'header' lines.

yuw444 commented 2 years ago

@TomSmithCGAT, sorry for the confusion. I have updated the question above.

The coding scheme(format) of R1 and R2 after extraction may benefit the pipeline. But cellranger will complain if I feed them into cellranger pipeline. So, I was thinking put the filtered R1, R2 back into cellranger pipeline to keep the consistency of my project, as I used cellranger pipeline for UMI counts.

IanSudbery commented 2 years ago

Probably easiest if instead of putting the sequence back in the R1, you discard the processed R1, and rebuild it with the sequence intact from the unprocessed file.

yuw444 commented 2 years ago

Thanks for your suggestion.