[ERROR] out of bounds. Start: 5 End: 14 Length: 1

sklages commented 9 years ago

Hi,

I want to feed an adaptor-trimmed MP library to super_deduper and ran into the problem mentioned in the title. I know from the docs that super_deduper dies, when encountering such reads. Is there a way to make super_duper simply discard these read (pairs)? This would be a more convenient behavior.

I have been trimming the data with Illumina's nxtrim without specifying a "minlength". Thus I think such reads (pairs) must be skipped ..

If not, consider this a feature request ;-)

As I don't have any results yet, I wonder if super_deduper appends something like "/1" and "/2" to the headers of interleaved output? Just another feature (at least an optional one) request for the software.

best, Sven

dstreett commented 9 years ago

Hey @sklages,

Thanks for the note. I do agree with that suggestion I will add the skipping of short pairs (with a warning at the end). I will also double check and ensure /1 and /2 are added to the end.

Just for my own curiosity, where did you hear about this application. Thank you for your interest.

Regards, David

sklages commented 9 years ago

Hi David, I was was just searching something somewhere on biostars and someone just mentioned super_deduper in a thread[1]. So lucky coincidence ;-)

A final statistical report of what has been done/clipped/removed should be generated when super_deduper has finished doing its jobs ..

You may want to consider to announce/introduce super_deduper on seqanswers.[2]

best, Sven [1]=https://www.biostars.org/p/160084/#160089 [2]=http://seqanswers.com/

dstreett commented 9 years ago

Hey @sklages ,

Very cool! Thanks for showing me that.

I hope Super-Deduper works well for you. You might want to try and run super-deduper before adapter trimming. Our pipeline here is Contaminate Screening -> Deduplication -> Trimming Based on Q-Score -> Adapter Trimming/Overlap.

Yes, Super Deduper will output stats on what happened once it is completed running. Thank you for the seqanswer suggestion. We are hoping to get an app note and a paper out here in the next couple of weeks on Super Deduper, so probably after the app note.

I added the request - there will be an error message, but the application will keep on running. I also added #0/1 and #0/2 to the end of the ids (if they were absent). Please, let me know if you have any other issues or questions!

Regards, David

msettles commented 9 years ago

The implementation for #0/1, #0/2 is incorrect and should only be added with a flag, David whenever producing fastq output, always follow https://en.wikipedia.org/wiki/FASTQ_format exactly. the #0/1 #0/2 is the pre-Casava 1.8 format, the 1:Y:18:ATCACG is the current format, while I agree Super_Deduper can easily output pre-Casava 1.8 (With flag), the current output (@HWI-700593F:551:H723CBCXX:1:1205:20213:63859 1:N:0:CAGATCTA#0/1) is a mix of old and 2 and NOT as valid fastq header format. Add flag for pre-Casava 1.8 output and then follow the wiki page formatting.

sklages commented 9 years ago

@msettles , well, you're right in that, that the postfixing of the header should be optional, implemented with a flag. That's what I initially wrote. Concerning the above mentioned header format, you are mixing up two different things: while it is a valid fastq header, it's not a valid illumina formatted header. There are some software recognizing their (weird) format (they introduced a whitespace in the header, making everything after this whitespace an optional description, including the read number!) while other software will be discarding this optional description leaving the read ID non-unique in interleaved paired-end data.

But before dedupping, I have possibly already error-corrected my data. In this case the illumina header has already been replaced by some other header/info .. usually when I use non-illumina software.

Finally, a software for dedupping fastq files must not necessarily care about header formats when feeded with separate files for pe data. Just leave it as it is or optionally add infos (via flags, e.g. read number, occurences etc.) as most downstream software (e.g. shotgun assemblers) require some special header/input format for full functionality. You might even think about creating custom headers, e.g just like tally does [1].

Just my 2p, Sven

[1]=ftp://ftp.ebi.ac.uk/pub/contrib/enrightlab/kraken/reaper/src/reaper-15-065/doc/tally.html

sklages commented 9 years ago

Just to be more precise concerning "/1" and "/2": these postfixes should be added to the ID (the part of the header string up to the first whitespace). If there is no whitespace in the header then it is OK to append the postfixes at the end of the line:

@DD80000:220:C5TR3ACXX:8:1101:1359:1981 1:N:0:AGTTCC
@DD80000:220:C5TR3ACXX:8:1101:1359:1981 2:N:0:AGTTCC
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The '^' part is the ID, the rest simply a description and may be considered optional.

Result could look like:

@DD80000:220:C5TR3ACXX:8:1101:1359:1981/1
@DD80000:220:C5TR3ACXX:8:1101:1359:1981/2

or, alternatively, knowing "original illumina header" format:

@DD80000:220:C5TR3ACXX:8:1101:1359:1981:N:0:AGTTCC/1
@DD80000:220:C5TR3ACXX:8:1101:1359:1981:N:0:AGTTCC/2

(20151201: edited, headers before /1 and /2 must be unique)

Just some ideas, Sven

dstreett / Super-Deduper

[ERROR] out of bounds. Start: 5 End: 14 Length: 1 #13