Open BenjaminSchwessinger opened 8 years ago
In some usual circumstance, it is possible that the cumsum of the haplotigs that can be bigger than the primary contig. This may happen when some repeats recruit extra parts in the connected graph components. I do need to see what the nature of the repeat and maybe possibly drive new heuristics to handle such cases. I don't quite understand what you mean by "all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file"? Can you elaborate?
Apologies badly worded at the end.
so the p_ctg.xxx.fa in 1-hasm/xxx file for these two cases, where sum(h_ctg length) > p_ctg length, only contains a header and no actual contig sequence.
Any idea why?
Would you advise to simply use the respective Falcon p_ctg sequence for now?
On 6/17/16 3:56 PM, Jason Chin wrote:
In some usual circumstance, it is possible that the cumsum of the haplotigs that can be bigger than the primary contig. This may happen when some repeats recruit extra parts in the connected graph components. I do need to see what the nature of the repeat and maybe possibly drive new heuristics to handle such cases. I don't quite understand what you mean by "all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file"? Can you elaborate?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226686680, or mute the thread https://github.com/notifications/unsubscribe/AGLMhqE90IszG9tapx1BspTpkmCWCKWxks5qMjb1gaJpZM4I4DMN.
Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia
M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger
twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en
I will need to see the content of the directory. Is it possible to compress it and put on some server so I can take a look.
sure what all do you want?
the whole unzip or only the 1-hasm folder?
On 6/17/16 4:12 PM, Jason Chin wrote:
I will need to see the content of the directory. Is it possible to compress it and put on some server so I can take a look.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226688533, or mute the thread https://github.com/notifications/unsubscribe/AGLMhsk_8tKNEKbKy0lxyX7y_dOG6jVoks5qMjq7gaJpZM4I4DMN.
Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia
M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger
twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en
just the contig dir inside 1-hasm to start with...
Thanks Jason for looking at this.
You can find the two contig dirs of question here.
https://www.dropbox.com/sh/z6km7p5cuk4vloh/AADI0ULWN0lATxVMFlDpS6P0a?dl=0
in a little while there will be the whole unzip folder in case you need it.
On 6/17/16 4:32 PM, Jason Chin wrote:
just the contig dir inside 1-hasm to start with...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226690969, or mute the thread https://github.com/notifications/unsubscribe/AGLMhimDmf2Z7062Z1vcwYbJEFZCpzMKks5qMj9lgaJpZM4I4DMN.
Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia
M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger
twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en
can you also put the 2-falcon-asm
in the dropbox? Thanks.
This is what happens for 000017F. In FALCON-Unzip, it will try to main the same being and end from the initial assembly. Somehow, there might be a bug or incomplete data somewhere, it fails to find the initial path. (The code is written that there should be always a path but there might be some unexpected boundary cases.) Since the code fails to find the primary path, it classifies the rest paths as haplotigs.
See the attached plot where I align h_ctg_all.000017F.fa
to 000017F_ref.fa
.
You do see two major haplotyes. I will need the graph information in 2-falcon-asm
to see what might be the cause of the code failing to find a primary path (contig.)
Thanks Jason. Aligning h/a and p to each other was also my first idea. Yet Friday beers last night were quicker. Having a quick look right now as well it seems that 000017F-1 could be one h-contig and 000017F-5+3+2 the other. There seems be a quite some repeats at the same time.
In the other case of 000232F it appears that the p_ctg is simply contained within the h_ctg (see attached).
I dropped 2-asm-faclon as v8_1_falcon.tar.gz into the dropbox.
Thanks for looking at it!
///
Off to my sons bday!
On 6/18/16 5:10 AM, Jason Chin wrote:
This is what happens for 000017F. In FALCON-Unzip, it will try to main the same being and end from the initial assembly. Somehow, there might be a bug or incomplete data somewhere, it fails to find the initial path. (The code is written that there should be always a path but there might be some unexpected boundary cases.) Since the code fails to find the primary path, it classifies the rest paths as haplotigs.
See the attached plot where I align |h_ctg_all.000017F.fa| to |000017F_ref.fa|. scr2016-06-17_12-04-27_pm https://cloud.githubusercontent.com/assets/1071734/16161770/33766276-3484-11e6-95f8-6af36e876b62.jpg
You do see two major haplotyes. I will need the graph information in |2-falcon-asm| to see what might be the cause of the code failing to find a primary path (contig.)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226856027, or mute the thread https://github.com/notifications/unsubscribe/AGLMhkT3VUnwXc6RN0XZcfBos-GWjnnjks5qMvEXgaJpZM4I4DMN.
Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia
M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger
twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en
OK, this is what happens to the contig 000017F, the associated contig graph is a circle (see attached assembly graph). When the code looks for the primary path in the haplotig layout, it returns empty path as the begin and the end node are the same. We can catch such condition in the future. It is also interesting to see if the DNA molecule/the chromosome is indeed circular or the circularization is caused by repeats.
The contig 000017F has the same begin and the end node 000241361:E
$ cat 2-asm-falcon/ctg_paths | grep 000017F | cut -d " " -f 1-4
000017F ctg_linear 000241361:E~NA~000178652:E 000241361:E
We will close this issue once we implement some circle detection and layout correctly.
Thanks for that Jason. Indeed I will keep the 000017F_ref.fa for now as part of the primary contigs for downstream analysis.
I will have a look into the circularization bid and let you know how it goes in case you are interested.
Thanks!
Benjamin
On 6/18/16 10:45 AM, Jason Chin wrote:
OK, this is what happens to the contig 000017F, the associated contig graph is a circle (see attached assembly graph). When we look for the primary path in the haplotig layout, it returns empty path as the begin and the end node are the same. We can catch such cotndition in the future. It is also interesting to see if the DNA molecule/the chromosome is indeed circular or the circularization is caused by repeats.
scr2016-06-17_05-41-01_pm https://cloud.githubusercontent.com/assets/1071734/16168026/ae05fd5c-34b2-11e6-98d1-c9e04f2c781f.jpg
The contig 000017F has the same begin and the end node |000241361:E|
|$ cat 2-asm-falcon/ctg_paths | grep 000017F | cut -d " " -f 1-4 000017F ctg_linear 000241361:E~NA~000178652:E 000241361:E |
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226911470, or mute the thread https://github.com/notifications/unsubscribe/AGLMhkznjRxs8ZhE2vHQXOnsKWwMQVlWks5qMz-xgaJpZM4I4DMN.
Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia
M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger
twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en
Thanks for sharing early access to this exciting new tool. I have been running the pipeline successfully after some tweaking till the quiver.py stage on cluster that runs FALCON_04, Falcon Unzip and SMRT analysis smrtanalysis_2.3.0.140936. Both Falcon and Falcon_unzip run like a beauty. Quick summary below. v8_1_assembly_summary.txt
I found that two primary contigs, which length is shorter than the cumsum of all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file. e.g. 000017F p_ref.fa length = 1553122 vs. cumsum h_ctg.fa = 2911288 or 27799 vs. 47732
Comments or suggestions on how to handle this?