CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

too little reads left after umi_tools dedup! #435

Closed windtalker6 closed 4 years ago

windtalker6 commented 4 years ago

I used umi_tools to extract the umi information and mapped reads to the hg19.fa by bwa mem.

then I filtered out unmapped reads and secondary alignment reads and sort the file by coordinate. and then, I dedup it by:

umi_tools dedup -I in.bam --output-stats=deduplicated -S out.bam

however, I got a very low ratio of reads after the dedup process, generally no more than 40% left. which is shown as follows:

6_before_dedup.bam 17148324 6_after_dedup.bam 6665141 7_before_dedup.bam 38078276 7_after_dedup.bam 14323230 8_before_dedup.bam 36089420 8_after_dedup.bam 12068729 9_before_dedup.bam 20788314 9_after_dedup.bam 7352516 10_before_dedup.bam 38391988 10_after_dedup.bam 15553934 11_before_dedup.bam 35807601 11_after_dedup.bam 13999490 12_before_dedup.bam 67301666 12_after_dedup.bam 22083131 13_before_dedup.bam 45395970 13_after_dedup.bam 12007586 14_before_dedup.bam 38628043 14_after_dedup.bam 11971975 15_before_dedup.bam 47421802 15_after_dedup.bam 17929415

before_dedup.bam is the bam before dedup after_dedup.bam is the bam after dedup

the second column are reads number of each file.

then what's the problem?

I tried to dedup through picard:

================================================= java -Djava.io.tmpdir=./tmp -jar ./picard/picard.jar MarkDuplicates \ I=7_before_dedup.bam \ O=7_after_dedup.picard.bam \ M=7.marked_dup_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ REMOVE_DUPLICATES=true \ CREATE_INDEX=true

the 7_after_dedup.picard.bam has 34350858 reads, while 7_after_dedup.bam has only 14323230 reads.

why does umi_tools discarded so many reads? I

windtalker6 commented 4 years ago

I guesss I found the reason:

the umi_dedup seems to output only one read for a PE read pair:

===================================== $samtools view test_add_suffix.bam V300056107L1C006R0130756722_TGTCACTTGTCACT/1 163 chr1 10016 0 93M = 10176 258 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC GGGGGFGGGFGGFGGGGFGGGFGFGGGGGFGGGGGGFFGGGFGGGGGFGGHGGFGFGGGGGGGGFFGGGGGFGGGGFFGGGGFFGGHGGGFFG NM:i:0 MD:Z:93 MC:Z:61M5D32M AS:i:93 XS:i:93 RG:Z:20B2128957 V300056107L1C006R0130756722_TGTCACTTGTCACT/2 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;

after dedup, only the read 2 "V300056107L1C006R0130756722_TGTCACTTGTCACT/2" remains.

but how can I get rid of this ?

TomSmithCGAT commented 4 years ago

Looks like you're missing the --paired argument. See umi_tools dedup --help

windtalker6 commented 4 years ago

Looks like you're missing the --paired argument. See umi_tools dedup --help

no, I added the "--paired" argument, but got no difference.

umi_tools dedup -I test_add_suffix.bam \ --output-stats=deduplicated \ -S test_add_suffix.dedup.bam \ --method unique \ --buffer-whole-contig \ --paired

=========================================== $samtools view test_add_suffix.dedup.bam V300056107L1C006R0130756722_TGTCACTTGTCACT/2 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;

TomSmithCGAT commented 4 years ago

Just to clarify, are you saying the output is identical regardless of whether you include the --paired argument?

windtalker6 commented 4 years ago

Just to clarify, are you saying the output is identical regardless of whether you include the --paired argument?

yes, as you can see in my code:

umi_tools dedup -I test_add_suffix.bam \ --output-stats=deduplicated \ -S test_add_suffix.dedup.bam \ --method directional \ --buffer-whole-contig \ --paired

and the corresponding result:

$samtools view test_add_suffix.dedup.bam V300056107L1C006R0130756722_TGTCACTTGTCACT/2 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;

TomSmithCGAT commented 4 years ago

Ah, I misunderstood. The test file is literally just two reads!

Hmmm.. the mapping quality of read1 is zero but that shouldn't matter.

Would you mind attaching the test file here (with the headers) so I can have a quick play

TomSmithCGAT commented 4 years ago

Also, could you run the above with --log=log.txt and post the log file here please

windtalker6 commented 4 years ago

Ah, I misunderstood. The test file is literally just two reads!

Hmmm.. the mapping quality of read1 is zero but that shouldn't matter.

Would you mind attaching the test file here (with the headers) so I can have a quick play

thanks for your quick reply, but I can't upload a bam file. So I have to transfom them to txt files test_add_suffix.txt test.txt

you can transform them to bam by:

samtools view -bS test_add_suffix.txt >test_add_suffix.bam

windtalker6 commented 4 years ago

I modified the mapping quality of read1

and got the same result, no read1 left!!!

test_add_suffix.txt log.txt

BTW, I installed the umi_tools from the mater branch as you sugguested to avoid the error of "read1 and read2 do not match".

TomSmithCGAT commented 4 years ago

Hmmm.. so it appears the paired file is not found. See lines in bold. I'll look into this

2020-09-15 09:54:03,831 INFO command: dedup -I test_add_suffix.bam --output-stats=deduplicated -S test_add_suffix.dedup.bam --method directional --buffer-whole-contig --paired 2020-09-15 09:54:03,833 INFO total_umis 1 2020-09-15 09:54:03,833 INFO #umis 1 2020-09-15 09:54:03,837 INFO Searching for mates for 1 unmatched alignments 2020-09-15 09:54:03,838 INFO 1 mates never found 2020-09-15 09:54:03,855 INFO Reads: Input Reads: 1, Read pairs: 1 2020-09-15 09:54:03,855 INFO Number of reads out: 1 2020-09-15 09:54:03,855 INFO Total number of positions deduplicated: 1 2020-09-15 09:54:03,856 INFO Mean number of unique UMIs per position: 1.00 2020-09-15 09:54:03,856 INFO Max. number of unique UMIs per position: 1

windtalker6 commented 4 years ago

unmatched apart from these 2 reads, all my data before dedup are like this. bam_before_dedup

TomSmithCGAT commented 4 years ago

So the issue is that the read names still have the trailing /1 or /2. For UMI-tools to identify the paired read properly, the read names need to be identical. We previously were working through extraction of UMIs with read number suffixes in #431. How did you extract the UMIs in the end?

windtalker6 commented 4 years ago

So the issue is that the read names still have the trailing /1 or /2. For UMI-tools to identify the paired read properly, the read names need to be identical. We previously were working through extraction of UMIs with read number suffixes in #431. How did you extract the UMIs in the end?

my command is: umi_tools dedup -I ./in.coordinate.sort.bam --output-stats=deduplicated -S ./out.marked_duplicates.bam

actually, my data after extract UMI does not include the "/1 or /2" surffix, like this: read_befor_dedup2

the "/1 or /2" surffix is added mannually, only for testing

IanSudbery commented 4 years ago

What aligner are you using here?

Every aligner I've ever used strips the read suffix (i.e. the /1, /2) off the read name in the bam file. The SAM format specification specifies that the names should be identical. Further, that read you quoted is named /2, but a flag of 83 marks it as a read 1.

Okay. I see that in the rest of your file, the /1 and /2 are not present. In that screen shot, both read one (the read with the flag 99) and read 2 (the read with the flag 147) are present in the output.

IanSudbery commented 4 years ago

my command is: umi_tools dedup -I ./in.coordinate.sort.bam --output-stats=deduplicated -S ./out.marked_duplicates.bam

This command does not have the --paired switch enabled.

windtalker6 commented 4 years ago

What aligner are you using here?

Every aligner I've ever used strips the read suffix (i.e. the /1, /2) off the read name in the bam file. The SAM format specification specifies that the names should be identical. Further, that read you quoted is named /2, but a flag of 83 marks it as a read 1.

Okay. I see that in the rest of your file, the /1 and /2 are not present. In that screen shot, both read one (the read with the flag 99) and read 2 (the read with the flag 147) are present in the output.

no ,that's not the output of dedup process, but the input file ,which contains read 1 and read 2

TomSmithCGAT commented 4 years ago

@windtalker6 - don't add the suffixes manually to the bam, re-run with --paired and then compare with picard

windtalker6 commented 4 years ago

@windtalker6 - don't add the suffixes manually to the bam, re-run with --paired and then compare with picard

thank you very much, please don't close this issue, for I will re-run with --paired . and make a comparison with picard result.

but I guess it will discard more reads than picard markdup.

IanSudbery commented 4 years ago

Don't worry, we are not going to close the issue.

Outputs both reads for me:

$ umi_tools dedup -I test.bam --paired --log test.log --buffer-whole-contig --method directional --output-stats=deduplicated | samtools view
V300056107L1C006R0130756722_TGTCACTTGTCACT      83      chr1    10176   19      61M5D32M        =       10016   -258   ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT    FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG   NM:i:7  MD:Z:1A18C40^ACCCT32    MC:Z:93M       AS:i:75  XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;
V300056107L1C006R0130756722_TGTCACTTGTCACT      163     chr1    10016   0       93M     =       10176   258     CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC   GGGGGFGGGFGGFGGGGFGGGFGFGGGGGFGGGGGGFFGGGFGGGGGFGGHGGFGFGGGGGGGGFFGGGGGFGGGGFFGGGGFFGGHGGGFFG   NM:i:0  MD:Z:93 MC:Z:61M5D32M   AS:i:93 XS:i:93 RG:Z:20B2128957
IanSudbery commented 4 years ago

No, it should discard fewer reads than Picard. By default picard doesn't use UMIs, so it thinks all reads at the same coordinates are the same read. UMI-tools will discard more than picard if picard is configured to use UMIs, because picard thinks UMIs must be identical to be deuplicates, where as UMI-tools thinks they only must be related.

So it depends how picard is run.

windtalker6 commented 4 years ago

well, I do find umi_tools retained more reads than picard.

coordinate.sort.bam 38078276 #before_dedup marked_duplicates.adjacency.bam 35220427 #adjacency marked_duplicates.cluster.bam 35219515 #cluster marked_duplicates.directional.bam 35219616 #directional marked_duplicates.percentile.bam 35483003 #percentile marked_duplicates.unique.bam 35483001 #unique marked_duplicates.picard.bam 34350858 #picard

there are 5 options for the umi_tools dedup --method argument, and the corresponding result was slightly different. the default "directional" made the almost the least reads number after dedup. which is the best one? any sugguestions ?

IanSudbery commented 4 years ago

Directional is the best in almost all situations. We provide the others because we wrote them in for bench-marking purposes. There are a few situations where the others might be optimal. In particular, directional is the most time consuming. It also doesn't perform well at very high UMI counts. But then all deduplication performs badly in that situation.

windtalker6 commented 4 years ago

Directional is the best in almost all situations. We provide the others because we wrote them in for bench-marking purposes. There are a few situations where the others might be optimal. In particular, directional is the most time consuming. It also doesn't perform well at very high UMI counts. But then all deduplication performs badly in that situation.

Thanks a lot!

TomSmithCGAT commented 4 years ago

@windtalker6 - Are you OK if we close this issue now?

windtalker6 commented 4 years ago

@windtalker6 - Are you OK if we close this issue now?

sorry, I forgot to remind you my problem had been solved,. yes, you can close this issue.