dstreett / Super-Deduper

An application to remove PCR duplicates from high throughput sequencing runs.
11 stars 4 forks source link

Many duplicates (unexpected) #40

Open adomingues opened 8 years ago

adomingues commented 8 years ago

Hi David, I am trying super-deduper in some very very large datasets (like hundreds of millions reads). The reads have a random barcodes (unique molecule identifiers) in 5' and my goal is to collapse those reads that have the same barcode+sequence to get rid of PCR duplicates pre-mapping.

I thought I could do it with:

super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3_Rep1.de_ --gzip-output --length 6

which was impressively fast and used very little memory - great job! However the output has relatively few reads:

Final:| reads: 175226879 | duplicates: 174702121 | reads_written: 524758 | percent: 99.70 | discarded: 76297 | total_seconds: 757.36 | reads/sec: 231365.37

Is this an expected outcome or am I not used the correct settings. Assuming I can even use super-deduper for this particular task.

Cheers

msettles commented 8 years ago

A. Domingues,

Can you provide a little more about your dataset, looks to me to be single end reads (the –U), RNAseq? Exactly what, how long and where are your molecular barcodes?

Matt

From: "A. Domingues" notifications@github.com Reply-To: dstreett/Super-Deduper reply@reply.github.com Date: Tuesday, July 26, 2016 at 8:34 AM To: dstreett/Super-Deduper Super-Deduper@noreply.github.com Subject: [dstreett/Super-Deduper] Many duplicates (unexpected) (#40)

Hi David, I am trying super-deduper in some very very large datasets (like hundreds of millions reads). The reads have a random barcodes (unique molecule identifiers) in 5' and my goal is to collapse those reads that have the same barcode+sequence to get rid of PCR duplicates pre-mapping.

I thought I could do it with:

super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3Rep1.de --gzip-output --length 6 which was impressively fast and used very little memory - great job! However the output has relatively few reads:

Final:| reads: 175226879 | duplicates: 174702121 | reads_written: 524758 | percent: 99.70 | discarded: 76297 | total_seconds: 757.36 | reads/sec: 231365.37

Is this an expected outcome or am I not used the correct settings. Assuming I can even use super-deduper for this particular task.

Cheers

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

adomingues commented 8 years ago

Hi Matt,

sorry I should have been more specific. The data is from this GEO, and is single, 75bp. It is RNA-seq, albeit for a specific sub-nuclear fraction (NET-Seq). The barcodes are six 5’ end nucleotides - "mixed random hexameric sequences" according to the methods in GEO.

I hope this helps.

António

msettles commented 8 years ago

Thanks,

So Super Deduper wasn’t built to explicitly deal with this kind of library (meant to deal with shotgun data with ‘normal’ libraries, not those that include the 5’ barcodes), but I think it should work nonetheless and is worth a try, I’m not totally sure what I’d expect so I’ll be interested to see what you get.

A couple of points

Deduplication of single-end reads is difficult, in that you only have one end of the fragment in which to measure and reads with the same start point, but different end points will appear to be duplicates when they are actually not, why I never recommend SE and always use PE. In this dataset you have the inclusion of a random hexamer on each fragment to help in the PCR dup call, it will be interesting to see how

Super Deduper does here and so please report back and I’m going to request you run two setting and compare for use please

super deduper uses a key sequence within the read and duplicate keys are expected to be PCR duplicates, your 6bp molecular barcode alone however is not enough to determine a duplicate (there are only 4096 possible combinations and you have many, many more reads). What you’ll want to do is use the 5’ molecular barcode, plus some portion of the read to call a duplicate. The command you ran actually bypasses the key completely as the start location is after you key.

So can you please run

super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3Rep1.de --gzip-output --start 7 --length 15

and

super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3Rep1.de --gzip-output --start 1 --length 21

The result of the second one is the one your interested in, could you run these two and paste the output statistics?

Matt

From: "A. Domingues" notifications@github.com Reply-To: dstreett/Super-Deduper reply@reply.github.com Date: Tuesday, July 26, 2016 at 1:43 PM To: dstreett/Super-Deduper Super-Deduper@noreply.github.com Cc: Matt Settles mattsettles@gmail.com, Comment comment@noreply.github.com Subject: Re: [dstreett/Super-Deduper] Many duplicates (unexpected) (#40)

Hi Matt,

sorry I should have been more specific. The data is from this GEO, and is single, 75bp. It is RNA-seq, albeit for a specific sub-nuclear fraction (NET-Seq). The barcodes are six 5’ end nucleotides - "mixed random hexameric sequences" according to the methods in GEO.

I hope this helps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

adomingues commented 8 years ago

Thank you for the quick reply Matt,

super deduper uses a key sequence within the read and duplicate keys are expected to be PCR duplicates, your 6bp molecular barcode alone however is not enough to determine a duplicate (there are only 4096 possible combinations and you have many, many more reads).

Of course! I missed that even though I read the super-deduper paper :/

In this dataset you have the inclusion of a random hexamer on each fragment to help in the PCR dup call, it will be interesting to see how

In our lab we routinely use 2x4bp barcodes, one at the 5' end and another at the 3', for small RNAseq experiments in which we expect small sequence diversity and want to distinguish that from true PCR duplicates. For these libraries, the deduplication is then done simply (and naively) by collapsing all reads that have the exact same sequence. This usually works well, just using unix tools, but this dataset is so large that I had to try something more efficient. That said, if super-deduper works well for this data-set, I will probably include it in my small RNAseq pipeline.

And to be clear, this is not my data or experimental design. This has been previously published by another lab and I am just trying to re-analyse it.


Now for the test results:

LSBATCH: User input /home/adomingu/imb-kettinggr/common_bin/Super-Deduper/super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3_Rep1.de_7-15 --gzip-output --start 7 --length 15

Resource usage summary:

CPU time :               10855.24 sec.
Max Memory :             15186.68 MB
Average Memory :         14514.08 MB
Total Requested Memory : 25600.00 MB
Delta Memory :           10413.32 MB
(Delta: the difference between total requested memory and actual max usage.)
Max Swap :               15229 MB

Max Processes :          4
Max Threads :            5

The output (if any) follows:

Final:| reads: 175224149 | duplicates: 147473141 | reads_written: 27751008 | percent: 84.16 | discarded: 79027 | total_seconds: 10787.34 | reads/sec: 16243.50

LSBATCH: User input /home/adomingu/imb-kettinggr/common_bin/Super-Deduper/super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3_Rep1.de_1-21 --gzip-output --start 1 --length 21

Resource usage summary:

CPU time :               14958.32 sec.
Max Memory :             21254.66 MB
Average Memory :         20434.38 MB
Total Requested Memory : 25600.00 MB
Delta Memory :           4345.34 MB
(Delta: the difference between total requested memory and actual max usage.)
Max Swap :               21297 MB

Max Processes :          4
Max Threads :            5

The output (if any) follows:

Final:| reads: 175221640 | duplicates: 135016167 | reads_written: 40205473 | percent: 77.05 | discarded: 81536 | total_seconds: 14864.68 | reads/sec: 11787.78

I have also posted the memory usage in case you are curious about it. It seems like extending the size of the region used to determine duplication, and including the barcode, did increase the number of reads retained to a more acceptable number, but the effect is not has striking as I anticipated. I am running another test with --start 1 --length 26. The idea is to find a large enough portion of the read to detect unique sequences.

António

msettles commented 8 years ago

So you basically got an additional 13M reads by including the 6bp molecular barcode, I’d say that is quite a bit. When you run --start 1 --length 26, I’d expect to have a relatively minor adjustment in new reads (as a guess I’s say << 1M), the 15bp key should be sufficient to identify all the complexity in a sample, but with SE reads its hard to tell, addition of PE makes a huge difference in added complexity to detect PCR dups. I have found it not uncommon to have 70% duplicate rate, especially on RNAseq samples particularly when there is a lot of rRNA left in the sample. so maybe this is just OVER sequenced severely

Please let me know what you find, thanks Matt

From: "A. Domingues" notifications@github.com Reply-To: dstreett/Super-Deduper reply@reply.github.com Date: Thursday, July 28, 2016 at 7:02 AM To: dstreett/Super-Deduper Super-Deduper@noreply.github.com Cc: Matt Settles mattsettles@gmail.com, Comment comment@noreply.github.com Subject: Re: [dstreett/Super-Deduper] Many duplicates (unexpected) (#40)

Thank you for the quick reply Matt,

super deduper uses a key sequence within the read and duplicate keys are expected to be PCR duplicates, your 6bp molecular barcode alone however is not enough to determine a duplicate (there are only 4096 possible combinations and you have many, many more reads).

Of course! I missed that even though I read the super-deduper paper :/

In this dataset you have the inclusion of a random hexamer on each fragment to help in the PCR dup call, it will be interesting to see how

In our lab we routinely use 2x4bp barcodes, one at the 5' end and another at the 3', for small RNAseq experiments in which we expect small sequence diversity and want to distinguish that from true PCR duplicates. For these libraries, the deduplication is then done simply (and naively) by collapsing all reads that have the exact same sequence. This usually works well, just using unix tools, but this dataset is so large that I had to try something more efficient. That said, if super-deduper works well for this data-set, I will probably include it in my small RNAseq pipeline.

And to be clear, this is not my data or experimental design. This has been previously published by another lab and I am just trying to re-analyse it.

Now for the test results:

LSBATCH: User input /home/adomingu/imb-kettinggr/common_bin/Super-Deduper/super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3_Rep1.de_7-15 --gzip-output --start 7 --length 15

Resource usage summary: CPU time :               10855.24 sec. Max Memory :             15186.68 MB Average Memory :         14514.08 MB Total Requested Memory : 25600.00 MB Delta Memory :           10413.32 MB (Delta: the difference between total requested memory and actual max usage.) Max Swap :               15229 MB

Max Processes :          4 Max Threads :            5 The output (if any) follows:

Final:| reads: 175224149 | duplicates: 147473141 | reads_written: 27751008 | percent: 84.16 | discarded: 79027 | total_seconds: 10787.34 | reads/sec: 16243.50

LSBATCH: User input /home/adomingu/imb-kettinggr/common_bin/Super-Deduper/super_deduper -U data/NET-seq_HeLa_S3_Rep1.fastq.gz --output-prefix results/processed_reads/NET-seq_HeLa_S3_Rep1.de_1-21 --gzip-output --start 1 --length 21

Resource usage summary: CPU time :               14958.32 sec. Max Memory :             21254.66 MB Average Memory :         20434.38 MB Total Requested Memory : 25600.00 MB Delta Memory :           4345.34 MB (Delta: the difference between total requested memory and actual max usage.) Max Swap :               21297 MB

Max Processes :          4 Max Threads :            5 The output (if any) follows:

Final:| reads: 175221640 | duplicates: 135016167 | reads_written: 40205473 | percent: 77.05 | discarded: 81536 | total_seconds: 14864.68 | reads/sec: 11787.78

I have also posted the memory usage in case you are curious about it. It seems like extending the size of the region used to determine duplication, and including the barcode, did increase the number of reads retained to a more acceptable number, but the effect is not has striking as I anticipated. I am running another test with --start 1 --length 26. The idea is to find a large enough portion of the read to detect unique sequences.

António

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

adomingues commented 8 years ago

When you run --start 1 --length 26, I’d expect to have a relatively minor adjustment in new reads (as a guess I’s say << 1M)

The results just arrived :)

Final:| reads: 175216974 | duplicates: 121300940 | reads_written: 53916034 | percent: 69.23 | discarded: 86202 | total_seconds: 19046.97 | reads/sec: 9199.20

So an increase of ~13M reads for a 5bp increase in sequence. Not too shabby, but less than I expected.

the 15bp key should be sufficient to identify all the complexity in a sample, but with SE reads its hard to tell, addition of PE makes a huge difference in added complexity to detect PCR dups.

I guess we just have different sample types. As I mentioned, our usual type of data is from small RNAseq, and thus PE is not really useful, and we use the double barcodes to get rid of the PCRs. Working with piRNAs we just can't rely only on the RNA sequence without barcodes.

The type of the data that I used super-dedupper for is even less conventional, and even though according to the paper there should be little duplication, I really can't be sure.

Anyway, thank you for looking into this. I will be sure to praise super-dedupper.

Cheers, António