PacificBiosciences / apps-scripts

Miscellaneous scripts for applications of PacBio systems
Other
25 stars 11 forks source link

Small overlap between two haplotigs. #3

Open a-velt opened 7 years ago

a-velt commented 7 years ago

Dear Sarah,

First, thank you for your great tools, especially the removeNestedHaplotigs script.

I created a similar tool by using blasr in order to find the overlapping haplotigs. Sometimes, one haplotig is fully included in one other haplotig, and sometimes the overlap is small.

So, by looking on some examples, I see that when two haplotigs overlap, the true information is on the two. I explain : in the overlap, when haplotig A carries the SNP (different from the primary contig), haplotig B equals to the primary contig and vice-versa. So, by merging this overlap and by keeping only the SNPs different of the primary contig, I think I can keep one haplotig instead of two.

Here is a PNG representing what I explain :

https://www.dropbox.com/s/x0ksi6bg3mbdy4a/chevauchement_001043F_002_001043F_003.PNG?dl=0

At the top is haplotig A, then haplotig B and then illumina reads, all aligned on the primary contigs.

According to you, is it possible to create a script in order to merge the overlap, by keeping the SNPs different from the primary contigs, in order to obtain an unique haplotig ? I don't want to remove them otherwise I think I will lose a lot of informations. And if I keep the two, I will have some difficulties at the end when I will want to align some data on my genome assembly (because my reads will multi-map in this area). Have you ever worked on this problem?

Thank you very much for your help. Best, Amandine

a-velt commented 7 years ago

I allow myself to re-write because I just noticed a problem with the removeNestedHaplotigs script. The script considers my haplotig 000000F_001 to be nested. Now, it is just overlapping on some bases at each of its ends. But on everything else (between the beginning and the end), it is not overlapping and brings a lot of informations. (I see that by aligning haplotigs on primary contigs with blasr).

Example: https://www.dropbox.com/s/csxvw74up5b0kia/example_nested.PNG?dl=0

How do we deal with this? We cannot remove the haplotig nor keep the overlapping part isn't it? And I don't understand why this haplotig is nested because in your python script you do : if a[1] < b[2] and a[1] > b[1] and a[2] > b[1] and a[2] < b[2] and by comparing the haplotig to the two contigs that it overlapped separately, it does not fall within these criteria, so it should not be nested.

Also, sometimes, I have the impression that some haplotigs are removes while they overlap no other haplotig. Is it normal ? Is it due by the fact you use nucmer to perform the alignment and I visualize the alignement performed with blasr ? So nucmer and blasr don't give the same results ?

a-velt commented 6 years ago

Hi Sarah,

Please, allow me to reiterate my question about haplotig overlaps, because I have 29% of my haplotigs which are included in an overlap (small overlap) and I don't know how to handle this. I don't know if I can let them in my assembly and store this overlap in an associated file, that biologist can consult after the alignment ? Or I have to create a tool to parse the alignment and keep the multi-mapped reads only in these regions ? Which solution did you choose for that ? Maybe it's not a problem in your case ?

Thank you very much. Have a nice day, Amandine