Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

fastq file command #9

Closed abracarambar closed 3 years ago

abracarambar commented 3 years ago

Hi Daniel, I am trying to run UMICollapse on a fastq file in which the UMI codes have already been extracted from the reads and incorporated in the read headers after a hash sign:

@CL100155237L2C001R001_12#ACAGAGGGTTC/1 AGTTGAACTCACCACGTGTCGGTA + DCCDDBCDDEDEDCDBE:DDBCBB @CL100155237L2C001R001_43#AAGTCTGTGTG/1 CAATGTGATTTCTGCCCAGTG + DBCCBECDBBDDDCDDECBBD @CL100155237L2C001R001_44#AGCCCATCGTC/1 TTGCCAAGGATGTTTTCATTAAT + CACEEDBBD@CCCCDCDCA@DDC

I have run the following command and it does collapse duplicates but ignores the barcodes in the read headers:

umicollapse fastq -i ${FASTQ_FILE} -o ${DEDUPED_FASTQ_FILE} --umi-sep "#" --tag Arguments [fastq, -i, ${FASTQ_FILE} -o, /${DEDUPED_FASTQ_FILE}, --umi-sep, #, --tag] Done reading input file into memory! Done with the first pass for tracking clusters! Number of input reads 44385266 Number of unique reads 4384633 Number of groups of reads 3587489 UMI collapsing finished in 618.248 seconds!

Any help would be greatly appreciated. M

Daniel-Liu-c0deb0t commented 3 years ago

Ok so UMICollapse was designed to work similarly to UMI-tools, where you would align to some reference first and then collapse reads at the same alignment coordinate. The fastq mode was a shortcut to not calculate the alignment and instead directly collapse based on the entire read sequence (including the UMI). Your use case of collapsing reads from a fastq directly based on the UMI in the header without alignment is not currently supported (although it shouldn't be too difficult of a feature to add; an alternative is to convert the fastq into a bam file and collapse that). Is there a specific reason why you are doing things this way? Usually, the number of duplicates would be overestimated if only the UMI is taken into account without the rest of the read sequence.

abracarambar commented 3 years ago

Thanks Daniel. I have just been provided these fastq.gz file. The libraries are sRNAs and have been sequenced by BGI. As we are interested in plant host and virus sequences, we actually do not have available genomes to map to. It has been confirmed to me that there is no fragmentation step after UMI incorporation so the duplicate reads sharing a UMI should be of the same length. I am not suggesting to just rely on the UMIs but use both the information of the UMI currently in the header and the read sequence. Is this possible? Or do you recommend I modify the fastq fles so the UMIs are back at the 5' end of the reads so your tool can work?

abracarambar commented 3 years ago

I should clarify, we have genomes available but we need to do a denovo assembly and then extract the contigs that map to virus genomes and then map reads back to these, so downtrack I think we will be able to also use your tool using bam files, I just wanted to have a preliminary look at the fast files to get a feel for the level of duplication we are getting.

Daniel-Liu-c0deb0t commented 3 years ago

Okay, I see. If you just want to get a preliminary look at the duplication level, you should concatenate the read sequence with the UMI and collapse those modified reads with fastq mode. If this doesn't work for some reason, let me know and I'll see about implementing a flag to concatenate the read sequence and the UMI.

abracarambar commented 3 years ago

Thanks Daniel, so I managed to get some outputs. Can I ask you what is the difference between these 2 reads: @CL100155237L2C001R001_477 cluster_id=4684868 cluster_size=1 same_umi=1 TTTGGATTGAAGGGAGCTCTC + D;BDB69AC<=CA8BBDAECD @CL100155237L2C001R001_497 cluster_id=3483863 TCGGACCAGGCTTCATTCCCC + EDDCCDED@CECDECBCDEED

@CL100155237L2C001R001_497 was not assigned to any cluster, so unique @CL100155237L2C001R001_477 was assigned to a cluster in which there is only one read? Thank you. M

Daniel-Liu-c0deb0t commented 3 years ago

Both reads are assigned to different clusters. That's why they both have a cluster_id, which is a unique ID that all reads of the same cluster share. Only the consensus read of each cluster has the size of the cluster in the header. Note that the consensus read is the only read of a cluster that is kept when collapsing without the --tag option. So yes, your first read is part of a cluster (ID = 4684868) with only one read. This is a unique read, because it is part of a cluster of size = 1. Your second read is one of many reads in the cluster with ID = 3483863. It is not the consensus read. If you search the cluster ID in the fastq file, you will get all the other reads that are part of the same cluster. If you see a read with the cluster_size tag, you know it is the consensus read. For most analysis cases, you can probably ignore same_umi. Sorry about the confusion and let me know if you have any other questions.

Daniel-Liu-c0deb0t commented 3 years ago

I edited the readme to explain this a bit more, as it is confusing.

abracarambar commented 3 years ago

Thanks Daniel, I think I am doing something wrong then as my umicollapse output has the same amount of reads than my starting fastq file. Here is what I tried to do: Original fastq file: @CL100155237L2C001R001_35#TCTCCCCTTGG/1 TTCTTACCTATGCCACCCATTCCTT + CCDD@CDDCC@CDDDDEDDCBDDC? @CL100155237L2C001R001_53#CGGCGGGGTAA/1 ATAGAGTAGTGGTAACGAGGTCGA + BAC>ADBBCCCCCCBDC?BDADD8 @CL100155237L2C001R001_84#ACCGGATGGGG/1 TCCCTACTCCACCCATGCCATA + DEDEDCECEDCEDE?CCEEBCB @CL100155237L2C001R001_85#ATAGAGACTGC/1 TCCCTACTCCACCCATGCCATA + CEDECCECEECEDEBCCEDDCC

Fake fastq file with the UMI incorporated back at the start and fake qual values (D strings) also added

@CL100155237L2C001R001_35 TCTCCCCTTGGTTCTTACCTATGCCACCCATTCCTT + DDDDDDDDDDDCCDD@CDDCC@CDDDDEDDCBDDC? @CL100155237L2C001R001_53 CGGCGGGGTAAATAGAGTAGTGGTAACGAGGTCGA + DDDDDDDDDDDBAC>ADBBCCCCCCBDC?BDADD8 @CL100155237L2C001R001_84 ACCGGATGGGGTCCCTACTCCACCCATGCCATA + DDDDDDDDDDDDEDEDCECEDCEDE?CCEEBCB @CL100155237L2C001R001_85 ATAGAGACTGCTCCCTACTCCACCCATGCCATA + DDDDDDDDDDDCEDECCECEECEDEBCCEDDCC

I then ran the command umicollapse within a PBS script:

fastq -i $PBS_O_WORKDIR/${SAMPLE}.raw.fq -o $PBS_O_WORKDIR/${SAMPLE}_dedup_umi.fq -u 11 --tag '#'

I get this log Working directory is /work/hia_mt18005/diagnostics/2020/BGI_sRNA Arguments [fastq, -i, MT004_BGI_sRNA.raw.fq, -o, MT004_BGI_sRNA_dedup_umi.fq, -u, 11, --tag, #] Done reading input file into memory! Done with the first pass for tracking clusters! Number of input reads 49994547 Number of unique reads 33545828 Number of groups of reads 25208001 UMI collapsing finished in 4284.754 seconds! PBS Job 8900315.pbs CPU time : 01:57:03 Wall time : 01:11:28 Mem usage : 40975172kb

But I get no tags in the header and the resulting fastq file contains all the starting reads

@CL100155237L2C001R001_35 cluster_id=23319273 cluster_size=1 same_umi=1 TTCTTACCTATGCCACCCATTCCTT + CCDD@CDDCC@CDDDDEDDCBDDC? @CL100155237L2C001R001_53 cluster_id=17356567 cluster_size=1 same_umi=1 ATAGAGTAGTGGTAACGAGGTCGA + BAC>ADBBCCCCCCBDC?BDADD8 @CL100155237L2C001R001_84 cluster_id=9979574 TCCCTACTCCACCCATGCCATA + DEDEDCECEDCEDE?CCEEBCB @CL100155237L2C001R001_85 cluster_id=9968428 TCCCTACTCCACCCATGCCATA + CEDECCECEECEDEBCCEDDCC

Daniel-Liu-c0deb0t commented 3 years ago

Hm what do you mean when you say that there are no tags in the headers? The cluster_id, cluster_size, same_umi strings are the tags. They aren't in the SAM/BAM tags format, though. With the --tag option, no reads are removed. Instead, those tag strings are added to the end of the FASTQ headers. Btw, --tag does not take an any arguments (like '#' you have). Based on how you are running the code and the inputs/outputs, UMICollapse seems to be working fine, but you seem to be expecting something else. What kind of output are you looking for?

abracarambar commented 3 years ago

Oh I see sorry, I thought it would insert the UMI in the header, my mistake.

abracarambar commented 3 years ago

It all makes sense now, thanks. So I ran: grep cluster_id=3483863 MT004_BGI_sRNA_dedup_umi.fq

Cluster 3483863 has 305 reads, the one that was selected to represent the consensus shares the same umis with 29 other sequences and then within this cluster, there are reads that share the same UMIs with others and this is specified in their header. One more question: What is the difference with same_umi=1 and no same_umi mentioned? same_umi=1 means it shares the UMI with one other sequence?

@CL100155237L2C001R001_497 cluster_id=3483863 @CL100155237L2C001R005_471 cluster_id=3483863 @CL100155237L2C001R012_203631 cluster_id=3483863 @CL100155237L2C001R017_548336 cluster_id=3483863 @CL100155237L2C001R028_300077 cluster_id=3483863 @CL100155237L2C001R035_434617 cluster_id=3483863 @CL100155237L2C001R037_250079 cluster_id=3483863 same_umi=2 @CL100155237L2C001R045_24446 cluster_id=3483863 same_umi=1 @CL100155237L2C001R050_346532 cluster_id=3483863 @CL100155237L2C001R053_125054 cluster_id=3483863 @CL100155237L2C001R066_290650 cluster_id=3483863 @CL100155237L2C001R066_556844 cluster_id=3483863 @CL100155237L2C001R067_351018 cluster_id=3483863 @CL100155237L2C001R076_344553 cluster_id=3483863 @CL100155237L2C002R007_263988 cluster_id=3483863 @CL100155237L2C002R012_38140 cluster_id=3483863 @CL100155237L2C002R015_404627 cluster_id=3483863 @CL100155237L2C002R034_216767 cluster_id=3483863 @CL100155237L2C002R035_295550 cluster_id=3483863 @CL100155237L2C002R037_187084 cluster_id=3483863 @CL100155237L2C002R042_448211 cluster_id=3483863 @CL100155237L2C002R048_522434 cluster_id=3483863 @CL100155237L2C002R061_191747 cluster_id=3483863 same_umi=1 @CL100155237L2C002R063_437815 cluster_id=3483863 @CL100155237L2C002R070_118131 cluster_id=3483863 @CL100155237L2C002R074_334566 cluster_id=3483863 @CL100155237L2C002R077_91061 cluster_id=3483863 @CL100155237L2C002R077_423217 cluster_id=3483863 @CL100155237L2C002R081_311844 cluster_id=3483863 @CL100155237L2C003R009_153788 cluster_id=3483863 @CL100155237L2C003R012_321064 cluster_id=3483863 @CL100155237L2C003R022_244293 cluster_id=3483863 @CL100155237L2C003R023_445170 cluster_id=3483863 @CL100155237L2C003R025_205816 cluster_id=3483863 @CL100155237L2C003R025_230819 cluster_id=3483863 @CL100155237L2C003R029_232227 cluster_id=3483863 @CL100155237L2C003R032_515531 cluster_id=3483863 @CL100155237L2C003R033_235421 cluster_id=3483863 @CL100155237L2C003R033_503529 cluster_id=3483863 @CL100155237L2C003R034_238587 cluster_id=3483863 @CL100155237L2C003R035_39407 cluster_id=3483863 @CL100155237L2C003R038_135686 cluster_id=3483863 @CL100155237L2C003R039_139871 cluster_id=3483863 @CL100155237L2C003R043_329097 cluster_id=3483863 @CL100155237L2C003R045_286290 cluster_id=3483863 @CL100155237L2C003R054_256353 cluster_id=3483863 @CL100155237L2C003R055_221503 cluster_id=3483863 @CL100155237L2C003R061_229195 cluster_id=3483863 @CL100155237L2C003R061_282702 cluster_id=3483863 @CL100155237L2C003R064_149419 cluster_id=3483863 @CL100155237L2C003R066_104932 cluster_id=3483863 @CL100155237L2C003R066_330033 cluster_id=3483863 same_umi=2 @CL100155237L2C003R070_296490 cluster_id=3483863 @CL100155237L2C003R072_107101 cluster_id=3483863 @CL100155237L2C003R073_53375 cluster_id=3483863 @CL100155237L2C003R078_248123 cluster_id=3483863 @CL100155237L2C003R085_525227 cluster_id=3483863 @CL100155237L2C003R096_352940 cluster_id=3483863 same_umi=3 @CL100155237L2C004R004_427536 cluster_id=3483863 @CL100155237L2C004R005_21232 cluster_id=3483863 @CL100155237L2C004R005_356024 cluster_id=3483863 @CL100155237L2C004R011_456598 cluster_id=3483863 @CL100155237L2C004R016_568079 cluster_id=3483863 @CL100155237L2C004R020_225528 cluster_id=3483863 @CL100155237L2C004R026_476344 cluster_id=3483863 @CL100155237L2C004R031_9617 cluster_id=3483863 @CL100155237L2C004R048_74345 cluster_id=3483863 @CL100155237L2C004R065_279379 cluster_id=3483863 @CL100155237L2C004R069_226695 cluster_id=3483863 @CL100155237L2C004R071_249982 cluster_id=3483863 @CL100155237L2C004R087_542435 cluster_id=3483863 @CL100155237L2C004R089_343971 cluster_id=3483863 @CL100155237L2C004R094_349055 cluster_id=3483863 @CL100155237L2C005R013_421610 cluster_id=3483863 @CL100155237L2C005R019_417735 cluster_id=3483863 @CL100155237L2C005R021_482392 cluster_id=3483863 @CL100155237L2C005R027_525205 cluster_id=3483863 @CL100155237L2C005R035_407855 cluster_id=3483863 @CL100155237L2C005R037_182654 cluster_id=3483863 @CL100155237L2C005R038_363171 cluster_id=3483863 @CL100155237L2C005R039_14452 cluster_id=3483863 @CL100155237L2C005R049_359630 cluster_id=3483863 @CL100155237L2C005R053_186719 cluster_id=3483863 @CL100155237L2C005R057_82782 cluster_id=3483863 @CL100155237L2C005R061_127219 cluster_id=3483863 @CL100155237L2C005R073_11365 cluster_id=3483863 @CL100155237L2C005R090_460445 cluster_id=3483863 same_umi=2 @CL100155237L2C005R091_82318 cluster_id=3483863 @CL100155237L2C005R091_276513 cluster_id=3483863 @CL100155237L2C006R006_210866 cluster_id=3483863 @CL100155237L2C006R022_529461 cluster_id=3483863 @CL100155237L2C006R025_402990 cluster_id=3483863 @CL100155237L2C006R037_83921 cluster_id=3483863 @CL100155237L2C006R040_408508 cluster_id=3483863 @CL100155237L2C006R041_320154 cluster_id=3483863 @CL100155237L2C006R047_83799 cluster_id=3483863 @CL100155237L2C006R053_52016 cluster_id=3483863 @CL100155237L2C006R053_211453 cluster_id=3483863 @CL100155237L2C006R057_502551 cluster_id=3483863 @CL100155237L2C006R058_516773 cluster_id=3483863 @CL100155237L2C006R064_429409 cluster_id=3483863 same_umi=3 @CL100155237L2C006R065_288785 cluster_id=3483863 @CL100155237L2C006R070_465888 cluster_id=3483863 @CL100155237L2C006R091_436278 cluster_id=3483863 @CL100155237L2C006R094_185315 cluster_id=3483863 @CL100155237L2C006R094_288297 cluster_id=3483863 same_umi=2 @CL100155237L2C007R010_410532 cluster_id=3483863 @CL100155237L2C007R015_163687 cluster_id=3483863 @CL100155237L2C007R023_395408 cluster_id=3483863 @CL100155237L2C007R024_16103 cluster_id=3483863 @CL100155237L2C007R024_134041 cluster_id=3483863 @CL100155237L2C007R027_229869 cluster_id=3483863 @CL100155237L2C007R029_271840 cluster_id=3483863 @CL100155237L2C007R033_358438 cluster_id=3483863 @CL100155237L2C007R040_35242 cluster_id=3483863 @CL100155237L2C007R043_45398 cluster_id=3483863 @CL100155237L2C007R053_147180 cluster_id=3483863 @CL100155237L2C007R054_429052 cluster_id=3483863 @CL100155237L2C007R055_254472 cluster_id=3483863 @CL100155237L2C007R057_29859 cluster_id=3483863 @CL100155237L2C007R062_93698 cluster_id=3483863 @CL100155237L2C007R069_425982 cluster_id=3483863 @CL100155237L2C007R072_209796 cluster_id=3483863 @CL100155237L2C007R089_118098 cluster_id=3483863 @CL100155237L2C007R091_560145 cluster_id=3483863 @CL100155237L2C008R006_172542 cluster_id=3483863 @CL100155237L2C008R014_676 cluster_id=3483863 @CL100155237L2C008R025_398622 cluster_id=3483863 @CL100155237L2C008R030_126457 cluster_id=3483863 @CL100155237L2C008R033_38748 cluster_id=3483863 @CL100155237L2C008R038_105695 cluster_id=3483863 @CL100155237L2C008R051_247169 cluster_id=3483863 @CL100155237L2C008R054_366689 cluster_id=3483863 @CL100155237L2C008R056_288224 cluster_id=3483863 @CL100155237L2C008R067_236897 cluster_id=3483863 @CL100155237L2C008R067_511697 cluster_id=3483863 @CL100155237L2C008R069_41729 cluster_id=3483863 @CL100155237L2C008R069_482679 cluster_id=3483863 @CL100155237L2C008R075_541459 cluster_id=3483863 @CL100155237L2C008R079_1983 cluster_id=3483863 @CL100155237L2C008R082_240825 cluster_id=3483863 @CL100155237L2C008R085_47509 cluster_id=3483863 same_umi=2 @CL100155237L2C008R088_200777 cluster_id=3483863 @CL100155237L2C009R002_323256 cluster_id=3483863 @CL100155237L2C009R005_305143 cluster_id=3483863 @CL100155237L2C009R012_503889 cluster_id=3483863 @CL100155237L2C009R020_522038 cluster_id=3483863 @CL100155237L2C009R025_9833 cluster_id=3483863 @CL100155237L2C009R031_438870 cluster_id=3483863 @CL100155237L2C009R041_230028 cluster_id=3483863 @CL100155237L2C009R050_331799 cluster_id=3483863 @CL100155237L2C009R053_83186 cluster_id=3483863 @CL100155237L2C009R058_563254 cluster_id=3483863 @CL100155237L2C009R073_143781 cluster_id=3483863 @CL100155237L2C009R084_394661 cluster_id=3483863 @CL100155237L2C009R085_95940 cluster_id=3483863 @CL100155237L2C009R090_363955 cluster_id=3483863 same_umi=5 @CL100155237L2C009R096_255714 cluster_id=3483863 @CL100155237L2C010R009_178256 cluster_id=3483863 @CL100155237L2C010R011_164620 cluster_id=3483863 same_umi=1 @CL100155237L2C010R015_213535 cluster_id=3483863 @CL100155237L2C010R027_304638 cluster_id=3483863 same_umi=3 @CL100155237L2C010R036_141644 cluster_id=3483863 same_umi=3 @CL100155237L2C010R037_40543 cluster_id=3483863 @CL100155237L2C010R038_135021 cluster_id=3483863 @CL100155237L2C010R043_373344 cluster_id=3483863 @CL100155237L2C010R047_464351 cluster_id=3483863 @CL100155237L2C010R052_143580 cluster_id=3483863 @CL100155237L2C010R058_210308 cluster_id=3483863 same_umi=3 @CL100155237L2C010R059_349155 cluster_id=3483863 @CL100155237L2C010R061_465598 cluster_id=3483863 @CL100155237L2C010R071_175455 cluster_id=3483863 @CL100155237L2C010R083_264488 cluster_id=3483863 same_umi=4 @CL100155237L2C010R090_224979 cluster_id=3483863 @CL100155237L2C010R093_261535 cluster_id=3483863 same_umi=4 @CL100155237L2C011R010_480123 cluster_id=3483863 @CL100155237L2C011R011_35084 cluster_id=3483863 @CL100155237L2C011R017_397206 cluster_id=3483863 @CL100155237L2C011R021_392714 cluster_id=3483863 @CL100155237L2C011R029_119897 cluster_id=3483863 @CL100155237L2C011R034_328738 cluster_id=3483863 @CL100155237L2C011R034_350908 cluster_id=3483863 @CL100155237L2C011R036_111982 cluster_id=3483863 @CL100155237L2C011R039_166153 cluster_id=3483863 @CL100155237L2C011R039_532705 cluster_id=3483863 @CL100155237L2C011R042_320360 cluster_id=3483863 @CL100155237L2C011R046_530897 cluster_id=3483863 @CL100155237L2C011R065_502128 cluster_id=3483863 same_umi=1 @CL100155237L2C011R084_218767 cluster_id=3483863 @CL100155237L2C011R088_292568 cluster_id=3483863 @CL100155237L2C011R091_137075 cluster_id=3483863 same_umi=2 @CL100155237L2C011R093_32777 cluster_id=3483863 same_umi=4 @CL100155237L2C011R096_237850 cluster_id=3483863 @CL100155237L2C012R005_84561 cluster_id=3483863 @CL100155237L2C012R010_435815 cluster_id=3483863 same_umi=1 @CL100155237L2C012R011_398585 cluster_id=3483863 @CL100155237L2C012R012_374009 cluster_id=3483863 same_umi=2 @CL100155237L2C012R016_367641 cluster_id=3483863 @CL100155237L2C012R016_546312 cluster_id=3483863 @CL100155237L2C012R018_284310 cluster_id=3483863 @CL100155237L2C012R030_325079 cluster_id=3483863 @CL100155237L2C012R035_263432 cluster_id=3483863 same_umi=3 @CL100155237L2C012R038_546574 cluster_id=3483863 @CL100155237L2C012R040_367581 cluster_id=3483863 @CL100155237L2C012R041_338474 cluster_id=3483863 @CL100155237L2C012R044_376651 cluster_id=3483863 @CL100155237L2C012R049_62450 cluster_id=3483863 @CL100155237L2C012R053_133345 cluster_id=3483863 same_umi=2 @CL100155237L2C012R068_399935 cluster_id=3483863 @CL100155237L2C012R070_168524 cluster_id=3483863 same_umi=4 @CL100155237L2C012R072_455775 cluster_id=3483863 @CL100155237L2C012R073_59095 cluster_id=3483863 same_umi=3 @CL100155237L2C012R086_395361 cluster_id=3483863 @CL100155237L2C012R092_191346 cluster_id=3483863 same_umi=14 @CL100155237L2C012R093_229412 cluster_id=3483863 @CL100155237L2C013R014_199630 cluster_id=3483863 same_umi=2 @CL100155237L2C013R028_452432 cluster_id=3483863 @CL100155237L2C013R041_389432 cluster_id=3483863 @CL100155237L2C013R044_69287 cluster_id=3483863 same_umi=2 @CL100155237L2C013R044_217572 cluster_id=3483863 @CL100155237L2C013R048_286165 cluster_id=3483863 @CL100155237L2C013R050_67373 cluster_id=3483863 @CL100155237L2C013R051_158421 cluster_id=3483863 @CL100155237L2C013R053_402351 cluster_id=3483863 @CL100155237L2C013R054_539622 cluster_id=3483863 @CL100155237L2C013R057_196696 cluster_id=3483863 @CL100155237L2C013R065_146300 cluster_id=3483863 same_umi=2 @CL100155237L2C013R076_293904 cluster_id=3483863 same_umi=3 @CL100155237L2C013R083_516786 cluster_id=3483863 @CL100155237L2C013R092_555467 cluster_id=3483863 @CL100155237L2C014R003_567631 cluster_id=3483863 @CL100155237L2C014R005_171406 cluster_id=3483863 @CL100155237L2C014R009_360861 cluster_id=3483863 @CL100155237L2C014R011_284740 cluster_id=3483863 @CL100155237L2C014R015_367294 cluster_id=3483863 @CL100155237L2C014R018_248146 cluster_id=3483863 @CL100155237L2C014R020_452598 cluster_id=3483863 same_umi=3 @CL100155237L2C014R028_488474 cluster_id=3483863 same_umi=2 @CL100155237L2C014R030_197292 cluster_id=3483863 @CL100155237L2C014R037_160427 cluster_id=3483863 @CL100155237L2C014R038_200950 cluster_id=3483863 @CL100155237L2C014R048_451443 cluster_id=3483863 @CL100155237L2C014R051_551035 cluster_id=3483863 same_umi=13 @CL100155237L2C014R054_295466 cluster_id=3483863 @CL100155237L2C014R074_540023 cluster_id=3483863 @CL100155237L2C014R078_260233 cluster_id=3483863 @CL100155237L2C014R081_487892 cluster_id=3483863 same_umi=8 @CL100155237L2C014R086_49041 cluster_id=3483863 same_umi=6 @CL100155237L2C014R088_492029 cluster_id=3483863 @CL100155237L2C014R091_11888 cluster_id=3483863 same_umi=1 @CL100155237L2C015R001_91025 cluster_id=3483863 same_umi=8 @CL100155237L2C015R024_114582 cluster_id=3483863 @CL100155237L2C015R028_139862 cluster_id=3483863 @CL100155237L2C015R030_276975 cluster_id=3483863 same_umi=2 @CL100155237L2C015R034_551844 cluster_id=3483863 @CL100155237L2C015R042_317038 cluster_id=3483863 @CL100155237L2C015R046_98888 cluster_id=3483863 @CL100155237L2C015R046_529784 cluster_id=3483863 same_umi=4 @CL100155237L2C015R054_496225 cluster_id=3483863 @CL100155237L2C015R055_193615 cluster_id=3483863 @CL100155237L2C015R056_439576 cluster_id=3483863 @CL100155237L2C015R059_84502 cluster_id=3483863 same_umi=7 @CL100155237L2C015R059_230293 cluster_id=3483863 @CL100155237L2C015R063_81463 cluster_id=3483863 @CL100155237L2C015R064_112161 cluster_id=3483863 same_umi=4 @CL100155237L2C015R066_495517 cluster_id=3483863 same_umi=15 @CL100155237L2C015R075_91240 cluster_id=3483863 same_umi=5 @CL100155237L2C015R081_17262 cluster_id=3483863 same_umi=5 @CL100155237L2C015R086_122004 cluster_id=3483863 @CL100155237L2C015R091_524780 cluster_id=3483863 same_umi=5 @CL100155237L2C015R093_294080 cluster_id=3483863 @CL100155237L2C015R094_126769 cluster_id=3483863 @CL100155237L2C015R096_210550 cluster_id=3483863 same_umi=3 @CL100155237L2C016R001_213142 cluster_id=3483863 @CL100155237L2C016R002_199914 cluster_id=3483863 same_umi=1 @CL100155237L2C016R010_398063 cluster_id=3483863 same_umi=5 @CL100155237L2C016R011_100050 cluster_id=3483863 same_umi=10 @CL100155237L2C016R011_145852 cluster_id=3483863 @CL100155237L2C016R015_459646 cluster_id=3483863 same_umi=3 @CL100155237L2C016R020_299119 cluster_id=3483863 @CL100155237L2C016R024_560681 cluster_id=3483863 @CL100155237L2C016R034_508047 cluster_id=3483863 same_umi=1 @CL100155237L2C016R040_59767 cluster_id=3483863 same_umi=6 @CL100155237L2C016R040_179074 cluster_id=3483863 @CL100155237L2C016R049_127350 cluster_id=3483863 same_umi=13 @CL100155237L2C016R050_218931 cluster_id=3483863 same_umi=8 @CL100155237L2C016R058_412449 cluster_id=3483863 @CL100155237L2C016R058_506204 cluster_id=3483863 same_umi=6 @CL100155237L2C016R066_47115 cluster_id=3483863 same_umi=3 @CL100155237L2C016R069_100848 cluster_id=3483863 @CL100155237L2C016R078_80756 cluster_id=3483863 same_umi=5 @CL100155237L2C016R079_525767 cluster_id=3483863 cluster_size=305 same_umi=29 @CL100155237L2C016R084_59703 cluster_id=3483863 same_umi=5 @CL100155237L2C016R088_224116 cluster_id=3483863 same_umi=7 @CL100155237L2C016R094_14825 cluster_id=3483863 @CL100155237L2C017R002_148694 cluster_id=3483863 @CL100155237L2C017R006_424363 cluster_id=3483863 same_umi=2 @CL100155237L2C017R018_452002 cluster_id=3483863 same_umi=3 @CL100155237L2C017R020_37275 cluster_id=3483863 @CL100155237L2C017R023_195100 cluster_id=3483863 same_umi=2 @CL100155237L2C017R031_154694 cluster_id=3483863 same_umi=4 @CL100155237L2C017R074_244331 cluster_id=3483863 @CL100155237L2C017R076_291951 cluster_id=3483863 same_umi=14 @CL100155237L2C017R079_354472 cluster_id=3483863 @CL100155237L2C017R086_353167 cluster_id=3483863 same_umi=1

Daniel-Liu-c0deb0t commented 3 years ago

Yeah so the main reason why you would put the UMI into the read headers is so the UMI does not get in the way during alignment. But the purpose of fastq mode is mainly to avoid doing alignment while collapsing reads. That's why UMICollapse doesn't support putting the UMI in the header in fastq mode.

For your question about same_umi: The UMI grouping process has two steps:

  1. collapse UMIs directly based on identity (exact match, so not allowing any mismatches)
  2. group together collapsed UMIs from step 1 that are similar (allowing mismatches)

Technically, a consensus read is chosen during step 1 for each set of reads with the exact same UMI, and then a consensus read is chosen out of all the consensus reads in each cluster/group of UMIs in step 2. The consensus reads from step 1 are labeled with same_umi, indicating the number of reads with the exact same UMI, while the consensus reads from step 2 are labeled with the cluster size. same_umi=1 without the cluster_size tag would indicate that a UMI is unique, but it is still part of a cluster (the UMI is similar to other reads in the cluster). The reason for only labelling consensus reads in each step is to minimize the amount of extra text added to the headers.

abracarambar commented 3 years ago

Ok, thanks, so during the second step, this read was incorporated into a cluster of reads but it did not share an exact UMI with these. What is the level of mismatch you are allowing btw? Regarding the final ouput: Number of unique reads 41616256 => step 1 you refer to? Number of groups of reads 31574834 => step 2 you refer to? This counts correspond to each step you refer to?

Daniel-Liu-c0deb0t commented 3 years ago

The number of mismatches is specified with -k and it defaults to 1. Yes, you are correct about the counts.

abracarambar commented 3 years ago

Thanks so much Daniel Liu for taking the time to reply to all my questions. I really appreciate it.

Daniel-Liu-c0deb0t commented 3 years ago

No problem. I'm always happy to help. I'll close this issue for now. Let me know if you have any other questions/issues.