JoseBlanca / seq_crumbs

Little sequence file utilities meant to work within Unix pipelines
Other
37 stars 10 forks source link

Question about split_matepairs: extensions and orphan reads. #11

Closed lindenb closed 9 years ago

lindenb commented 9 years ago

Hi, have been asked to use split_matepairs to split sequences produced by a ion_torrent. When I look at the reads, here are the names I see:

why two extensions for the names _pl.part1 or \1 ? why are there some orphan reads ?

(...)
@LWVN9:05537:02499_pl.part1
@LWVN9:05537:02499_pl.part2
@LWVN9:04428:02076_pl.part1
@LWVN9:04428:02076_pl.part2
@LWVN9:07344:10221\1
@LWVN9:07344:10221\2
@LWVN9:02055:09950\1
@LWVN9:02055:09950\2
@LWVN9:08140:00684 <=================== orphan
@LWVN9:06292:00982\1
@LWVN9:06292:00982\2
@LWVN9:04209:12170 

Thanks for your help.

P.

JoseBlanca commented 9 years ago

Hi. Those two different extensions are confusing, You're right. We have modified them to be _pl\1 and \1. We've made and comitted the change. The difference between them is quite subtle and you can ignore it for the most part. Imagine you have both parts of the mate pair in a structure like:

read MATE1linkerMATE2

In this case the program would generate two sequences:

read\1 MATE1 read\2 MATE2

Now imagine that because of a sequencing error or a problem during the ligation the linker is not complete:

read MATE1linkeMATE2

In this case the program would generate two sequences:

read\1 MATE1 read\2 ATE2

As you see the splitter suspects that there's a problem in the ligation site between the linker and the second half and tries to be conservative be removing the few nucleotides that would complete the linker length even if those nucleotides do not match the linker sequence. In this case we introduce the _pl to mean that the linker match was only partial (partial linker). There are orphans because in some reads the linker could be found at the very begining or end or because it wasn't found at all. In those cases you don't get two fragments, but just one. If you want to get the orphans in a different file you could use pair_matcher after you've finished split_matepairs. Finaly, and just to cover all cases you could also find some _mlc. Those appear when the linker is found more than once.

lindenb commented 9 years ago

Thanks !