alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

paired end read names of reverse read taken from forward read #148

Closed thomasvangurp closed 8 years ago

thomasvangurp commented 8 years ago

Dear Alex,

I'm using STAR for something it was not designed for, mapping reduced representation bisulfite sequencing reads to a pseudo reference genome. This goes really well. One trick I apply is to make a ridiculously long read name containing the original sequence, whereas the true read is computationally converted C>T in the forward and G>A in the reverse read.

So,

@D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC
ATGTGAT
+
IIIIIII

Now for the mate I would do something similar

@D00204R:84:C89ETANXX:8:2109:6159:3804_TGAGTACT
TAAATACT
+
IIIIIIII

The resulting SAM output will be: D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC ATGTGAT D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC TAAATACT

I wrote a custom script to modify the SAM file on the fly, to recreate the original sequence. This works perfectly for the forward and single end reads, but not for the reverse, as it seems that the name record in the output file is identical for r/1 and r/2 reads, which normally would never be an issue of course. However, in my scenario this makes it a bit hard. I'm opting for using " --outSAMreadID Number" combined with a sambamba sort on read name. Using the original fastq i can then create a good sam file with the original sequences. This is less than ideal as it's slower than having the sequence ready in the name directly .

If an easy fix could be made this would be very much appreciated!

Thanks, Thomas

mmterpstra commented 8 years ago

This might be somewhat inefficient but put the original sequence of both pairs in the header: then the read names will be indentical. For more effiency you might also consider a cigar like encoding (for the example:1m1t4m1t). Op 11 mei 2016 9:12 p.m. schreef "Thomas van Gurp" <notifications@github.com

:

Dear Alex,

I'm using STAR for something it was not designed for, mapping reduced representation bisulfite sequencing reads to a pseudo reference genome. This goes really well. One trick I apply is to make a ridiculously long read name containing the original sequence, whereas the true read is computationally converted C>T in the forward and G>A in the reverse read.

So,

@D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC ATGTGAT + IIIIIII

Now for the mate I would do something similar

@D00204R:84:C89ETANXX:8:2109:6159:3804_TGAGTACT TAAATACT + IIIIIIII

The resulting SAM output will be: D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC ATGTGAT D00204R:84:C89ETANXX:8:2109:6159:3804_ACTGTGAC TAAATACT

I wrote a custom script to modify the SAM file on the fly, to recreate the original sequence. This works perfectly for the forward and single end reads, but not for the reverse, as it seems that the name record in the output file is identical for r/1 and r/2 reads, which normally would never be an issue of course. However, in my scenario this makes it a bit hard. I'm opting for using " --outSAMreadID Number" combined with a sambamba sort on read name. Using the original fastq i can then create a good sam file with the original sequences. This is less than ideal as it's slower than having the sequence ready in the name directly .

If an easy fix could be made this would be very much appreciated!

Thanks, Thomas

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/alexdobin/STAR/issues/148

mmterpstra commented 8 years ago

for example:

@D00204R:84:C89ETANXX:8:2109:6159:3804_mt4mt_mama4m
ATGTGAT
+
IIIIIII
@D00204R:84:C89ETANXX:8:2109:6159:3804_mt4mt_mama4m
TAAATACT
+
IIIIIIII
alexdobin commented 8 years ago

Hi Thomas,

SAM format requires that the names of both mates are exactly the same - otherwise, it would be impossible to establish pairing between the alignments. I agree with @mmterpstra that the best approach would be to encode both sequences in the names of the reads in the fastq files (actually, you can do it only for read 1 file, since this is the one used in the SAM output.

Cheers Alex