Closed windtalker6 closed 4 years ago
I guesss I found the reason:
the umi_dedup seems to output only one read for a PE read pair:
after dedup, only the read 2 "V300056107L1C006R0130756722_TGTCACTTGTCACT/2" remains.
but how can I get rid of this ?
Looks like you're missing the --paired
argument. See umi_tools dedup --help
Looks like you're missing the
--paired
argument. Seeumi_tools dedup --help
no, I added the "--paired" argument, but got no difference.
umi_tools dedup -I test_add_suffix.bam \ --output-stats=deduplicated \ -S test_add_suffix.dedup.bam \ --method unique \ --buffer-whole-contig \ --paired
=========================================== $samtools view test_add_suffix.dedup.bam V300056107L1C006R0130756722_TGTCACTTGTCACT/2 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;
Just to clarify, are you saying the output is identical regardless of whether you include the --paired
argument?
Just to clarify, are you saying the output is identical regardless of whether you include the
--paired
argument?
yes, as you can see in my code:
umi_tools dedup -I test_add_suffix.bam \ --output-stats=deduplicated \ -S test_add_suffix.dedup.bam \ --method directional \ --buffer-whole-contig \ --paired
and the corresponding result:
$samtools view test_add_suffix.dedup.bam V300056107L1C006R0130756722_TGTCACTTGTCACT/2 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;
Ah, I misunderstood. The test file is literally just two reads!
Hmmm.. the mapping quality of read1 is zero but that shouldn't matter.
Would you mind attaching the test file here (with the headers) so I can have a quick play
Also, could you run the above with --log=log.txt
and post the log file here please
Ah, I misunderstood. The test file is literally just two reads!
Hmmm.. the mapping quality of read1 is zero but that shouldn't matter.
Would you mind attaching the test file here (with the headers) so I can have a quick play
thanks for your quick reply, but I can't upload a bam file. So I have to transfom them to txt files test_add_suffix.txt test.txt
you can transform them to bam by:
samtools view -bS test_add_suffix.txt >test_add_suffix.bam
I modified the mapping quality of read1
and got the same result, no read1 left!!!
BTW, I installed the umi_tools from the mater branch as you sugguested to avoid the error of "read1 and read2 do not match".
Hmmm.. so it appears the paired file is not found. See lines in bold. I'll look into this
2020-09-15 09:54:03,831 INFO command: dedup -I test_add_suffix.bam --output-stats=deduplicated -S test_add_suffix.dedup.bam --method directional --buffer-whole-contig --paired 2020-09-15 09:54:03,833 INFO total_umis 1 2020-09-15 09:54:03,833 INFO #umis 1 2020-09-15 09:54:03,837 INFO Searching for mates for 1 unmatched alignments 2020-09-15 09:54:03,838 INFO 1 mates never found 2020-09-15 09:54:03,855 INFO Reads: Input Reads: 1, Read pairs: 1 2020-09-15 09:54:03,855 INFO Number of reads out: 1 2020-09-15 09:54:03,855 INFO Total number of positions deduplicated: 1 2020-09-15 09:54:03,856 INFO Mean number of unique UMIs per position: 1.00 2020-09-15 09:54:03,856 INFO Max. number of unique UMIs per position: 1
unmatched apart from these 2 reads, all my data before dedup are like this.
So the issue is that the read names still have the trailing /1
or /2
. For UMI-tools to identify the paired read properly, the read names need to be identical. We previously were working through extraction of UMIs with read number suffixes in #431. How did you extract the UMIs in the end?
So the issue is that the read names still have the trailing
/1
or/2
. For UMI-tools to identify the paired read properly, the read names need to be identical. We previously were working through extraction of UMIs with read number suffixes in #431. How did you extract the UMIs in the end?
my command is: umi_tools dedup -I ./in.coordinate.sort.bam --output-stats=deduplicated -S ./out.marked_duplicates.bam
actually, my data after extract UMI does not include the "/1 or /2" surffix, like this:
the "/1 or /2" surffix is added mannually, only for testing
What aligner are you using here?
Every aligner I've ever used strips the read suffix (i.e. the /1
, /2
) off the read name in the bam file. The SAM format specification specifies that the names should be identical. Further, that read you quoted is named /2
, but a flag of 83 marks it as a read 1.
Okay. I see that in the rest of your file, the /1
and /2
are not present. In that screen shot, both read one (the read with the flag 99) and read 2 (the read with the flag 147) are present in the output.
my command is: umi_tools dedup -I ./in.coordinate.sort.bam --output-stats=deduplicated -S ./out.marked_duplicates.bam
This command does not have the --paired
switch enabled.
What aligner are you using here?
Every aligner I've ever used strips the read suffix (i.e. the
/1
,/2
) off the read name in the bam file. The SAM format specification specifies that the names should be identical. Further, that read you quoted is named/2
, but a flag of 83 marks it as a read 1.Okay. I see that in the rest of your file, the
/1
and/2
are not present. In that screen shot, both read one (the read with the flag 99) and read 2 (the read with the flag 147) are present in the output.
no ,that's not the output of dedup process, but the input file ,which contains read 1 and read 2
@windtalker6 - don't add the suffixes manually to the bam, re-run with --paired
and then compare with picard
@windtalker6 - don't add the suffixes manually to the bam, re-run with
--paired
and then compare with picard
thank you very much, please don't close this issue, for I will re-run with --paired
. and make a comparison with picard result.
but I guess it will discard more reads than picard markdup.
Don't worry, we are not going to close the issue.
Outputs both reads for me:
$ umi_tools dedup -I test.bam --paired --log test.log --buffer-whole-contig --method directional --output-stats=deduplicated | samtools view
V300056107L1C006R0130756722_TGTCACTTGTCACT 83 chr1 10176 19 61M5D32M = 10016 -258 ACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAAACCCTAAACCCTAAACCCTAACCCTAACCCT FFFFGFFFFFGFFFFFFGFFFFFFFF@GFFFFFFGFFFFFFGFGFFFFFFFFFFFFFFGFFFFFFGFFFFFFGFFFFFFFFFFFFFGFFFFFG NM:i:7 MD:Z:1A18C40^ACCCT32 MC:Z:93M AS:i:75 XS:i:66 RG:Z:20B2128957 XA:Z:chr5,-11726,68M5D19M6S,7;
V300056107L1C006R0130756722_TGTCACTTGTCACT 163 chr1 10016 0 93M = 10176 258 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC GGGGGFGGGFGGFGGGGFGGGFGFGGGGGFGGGGGGFFGGGFGGGGGFGGHGGFGFGGGGGGGGFFGGGGGFGGGGFFGGGGFFGGHGGGFFG NM:i:0 MD:Z:93 MC:Z:61M5D32M AS:i:93 XS:i:93 RG:Z:20B2128957
No, it should discard fewer reads than Picard. By default picard doesn't use UMIs, so it thinks all reads at the same coordinates are the same read. UMI-tools will discard more than picard if picard is configured to use UMIs, because picard thinks UMIs must be identical to be deuplicates, where as UMI-tools thinks they only must be related.
So it depends how picard is run.
well, I do find umi_tools retained more reads than picard.
coordinate.sort.bam 38078276 #before_dedup marked_duplicates.adjacency.bam 35220427 #adjacency marked_duplicates.cluster.bam 35219515 #cluster marked_duplicates.directional.bam 35219616 #directional marked_duplicates.percentile.bam 35483003 #percentile marked_duplicates.unique.bam 35483001 #unique marked_duplicates.picard.bam 34350858 #picard
there are 5 options for the umi_tools dedup --method argument, and the corresponding result was slightly different. the default "directional" made the almost the least reads number after dedup. which is the best one? any sugguestions ?
Directional is the best in almost all situations. We provide the others because we wrote them in for bench-marking purposes. There are a few situations where the others might be optimal. In particular, directional is the most time consuming. It also doesn't perform well at very high UMI counts. But then all deduplication performs badly in that situation.
Directional is the best in almost all situations. We provide the others because we wrote them in for bench-marking purposes. There are a few situations where the others might be optimal. In particular, directional is the most time consuming. It also doesn't perform well at very high UMI counts. But then all deduplication performs badly in that situation.
Thanks a lot!
@windtalker6 - Are you OK if we close this issue now?
@windtalker6 - Are you OK if we close this issue now?
sorry, I forgot to remind you my problem had been solved,. yes, you can close this issue.
I used umi_tools to extract the umi information and mapped reads to the hg19.fa by bwa mem.
then I filtered out unmapped reads and secondary alignment reads and sort the file by coordinate. and then, I dedup it by:
umi_tools dedup -I in.bam --output-stats=deduplicated -S out.bam
however, I got a very low ratio of reads after the dedup process, generally no more than 40% left. which is shown as follows:
6_before_dedup.bam 17148324 6_after_dedup.bam 6665141 7_before_dedup.bam 38078276 7_after_dedup.bam 14323230 8_before_dedup.bam 36089420 8_after_dedup.bam 12068729 9_before_dedup.bam 20788314 9_after_dedup.bam 7352516 10_before_dedup.bam 38391988 10_after_dedup.bam 15553934 11_before_dedup.bam 35807601 11_after_dedup.bam 13999490 12_before_dedup.bam 67301666 12_after_dedup.bam 22083131 13_before_dedup.bam 45395970 13_after_dedup.bam 12007586 14_before_dedup.bam 38628043 14_after_dedup.bam 11971975 15_before_dedup.bam 47421802 15_after_dedup.bam 17929415
before_dedup.bam is the bam before dedup after_dedup.bam is the bam after dedup
the second column are reads number of each file.
then what's the problem?
I tried to dedup through picard:
================================================= java -Djava.io.tmpdir=./tmp -jar ./picard/picard.jar MarkDuplicates \ I=7_before_dedup.bam \ O=7_after_dedup.picard.bam \ M=7.marked_dup_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ REMOVE_DUPLICATES=true \ CREATE_INDEX=true
the 7_after_dedup.picard.bam has 34350858 reads, while 7_after_dedup.bam has only 14323230 reads.
why does umi_tools discarded so many reads? I