dstreett / Super-Deduper

An application to remove PCR duplicates from high throughput sequencing runs.
11 stars 4 forks source link

Deduplicating following adapter and quality trimming with Cutadapt #43

Closed KmKingsland closed 7 years ago

KmKingsland commented 7 years ago

Hello, My name is Kevin. I am doing QC on a set of RNAsequencing data. It is 150bp PE reads from 1 lane of illumina HiSeq 4,000. I have 16 samples total, making 32 starting files (R1 and R2 for all 16). I used Cutadapt to remove adapter sequences and perform quality trimming using the Paired end function. This generated a total of 64 files (4 files for all 16 samples (val R1; val R2; unpaired R1; unpaired R2 for each sample name).

I tried using SuperDeDuper to remove duplicate reads from these files using the scripts below. I wanted to generate a single output_nodup_PE1.fastq output file and a single output_nodup_PE2.fastq output file, if possible. However, after running the script I had a single PE 1 output file, about 55mb in size, and a single PE 2 output file, but it was empty (0mb in size).

Did I make a mistake in my script that prevented the PE2 output file from being populated? Should I perform separate runs with my paired and unpaired files?

My desire to remove duplicates is so that I have the simplest file to start from for my DeNovo transcriptome assembly.

Job submission file:

RNAsupdedupJob.txt

SuperDeDuper Script:

RNAsupdedup30Oct17.txt

samhunter commented 7 years ago

Hi Kevin, We aren't supporting SuperDeduper any more because it has been replaced with a new version that comes with HTStream (https://github.com/ibest/HTStream).

That being said, the original idea behind SuperDeduper was that it would be run first on the PE data (before trimming etc) and the resulting de-duplicated reads would then be processed through the rest of the cleaning pipeline. If you do any quality trimming, it is likely that you will trim off 5' bases on at least some of your reads, making the assumptions that SuperDeduper operates under incorrect (duplicate reads will no longer start at the same 5' position), and potentially producing strange results.

If you haven't done it already, I would recommend running a small subset of reads through whatever cleaning tools you decide to use on an interactive session to test that things are set up correctly and behaving as you expect. Once you are confident with the results, run the full pipeline on all files.

Sam

KmKingsland commented 7 years ago

Thanks so much Sam. I will try changing my order of operations, and run superdeduper first, using a subset of my data.

I will also look into HTStream.

Thanks again, Kevin