PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

lossing p contig sequence #20

Open BenjaminSchwessinger opened 8 years ago

BenjaminSchwessinger commented 8 years ago

Thanks for sharing early access to this exciting new tool. I have been running the pipeline successfully after some tweaking till the quiver.py stage on cluster that runs FALCON_04, Falcon Unzip and SMRT analysis smrtanalysis_2.3.0.140936. Both Falcon and Falcon_unzip run like a beauty. Quick summary below. v8_1_assembly_summary.txt

I found that two primary contigs, which length is shorter than the cumsum of all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file. e.g. 000017F p_ref.fa length = 1553122 vs. cumsum h_ctg.fa = 2911288 or 27799 vs. 47732

Comments or suggestions on how to handle this?

pb-jchin commented 8 years ago

In some usual circumstance, it is possible that the cumsum of the haplotigs that can be bigger than the primary contig. This may happen when some repeats recruit extra parts in the connected graph components. I do need to see what the nature of the repeat and maybe possibly drive new heuristics to handle such cases. I don't quite understand what you mean by "all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file"? Can you elaborate?

BenjaminSchwessinger commented 8 years ago

Apologies badly worded at the end.

so the p_ctg.xxx.fa in 1-hasm/xxx file for these two cases, where sum(h_ctg length) > p_ctg length, only contains a header and no actual contig sequence.

Any idea why?

Would you advise to simply use the respective Falcon p_ctg sequence for now?

On 6/17/16 3:56 PM, Jason Chin wrote:

In some usual circumstance, it is possible that the cumsum of the haplotigs that can be bigger than the primary contig. This may happen when some repeats recruit extra parts in the connected graph components. I do need to see what the nature of the repeat and maybe possibly drive new heuristics to handle such cases. I don't quite understand what you mean by "all corresponding haplotigs contig do not contain any sequence in respective 1-hasm/ folder and the final all_p_ctg.fa file"? Can you elaborate?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226686680, or mute the thread https://github.com/notifications/unsubscribe/AGLMhqE90IszG9tapx1BspTpkmCWCKWxks5qMjb1gaJpZM4I4DMN.

Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia

M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger

twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en

CRICOS Provider # 00120C

pb-jchin commented 8 years ago

I will need to see the content of the directory. Is it possible to compress it and put on some server so I can take a look.

BenjaminSchwessinger commented 8 years ago

sure what all do you want?

the whole unzip or only the 1-hasm folder?

On 6/17/16 4:12 PM, Jason Chin wrote:

I will need to see the content of the directory. Is it possible to compress it and put on some server so I can take a look.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226688533, or mute the thread https://github.com/notifications/unsubscribe/AGLMhsk_8tKNEKbKy0lxyX7y_dOG6jVoks5qMjq7gaJpZM4I4DMN.

Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia

M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger

twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en

CRICOS Provider # 00120C

pb-jchin commented 8 years ago

just the contig dir inside 1-hasm to start with...

BenjaminSchwessinger commented 8 years ago

Thanks Jason for looking at this.

You can find the two contig dirs of question here.

https://www.dropbox.com/sh/z6km7p5cuk4vloh/AADI0ULWN0lATxVMFlDpS6P0a?dl=0

in a little while there will be the whole unzip folder in case you need it.

On 6/17/16 4:32 PM, Jason Chin wrote:

just the contig dir inside 1-hasm to start with...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226690969, or mute the thread https://github.com/notifications/unsubscribe/AGLMhimDmf2Z7062Z1vcwYbJEFZCpzMKks5qMj9lgaJpZM4I4DMN.

Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia

M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger

twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en

CRICOS Provider # 00120C

pb-jchin commented 8 years ago

can you also put the 2-falcon-asm in the dropbox? Thanks.

pb-jchin commented 8 years ago

This is what happens for 000017F. In FALCON-Unzip, it will try to main the same being and end from the initial assembly. Somehow, there might be a bug or incomplete data somewhere, it fails to find the initial path. (The code is written that there should be always a path but there might be some unexpected boundary cases.) Since the code fails to find the primary path, it classifies the rest paths as haplotigs.

See the attached plot where I align h_ctg_all.000017F.fa to 000017F_ref.fa. scr2016-06-17_12-04-27_pm

You do see two major haplotyes. I will need the graph information in 2-falcon-asm to see what might be the cause of the code failing to find a primary path (contig.)

BenjaminSchwessinger commented 8 years ago

Thanks Jason. Aligning h/a and p to each other was also my first idea. Yet Friday beers last night were quicker. Having a quick look right now as well it seems that 000017F-1 could be one h-contig and 000017F-5+3+2 the other. There seems be a quite some repeats at the same time.

In the other case of 000232F it appears that the p_ctg is simply contained within the h_ctg (see attached).

I dropped 2-asm-faclon as v8_1_falcon.tar.gz into the dropbox.

Thanks for looking at it!

///

Off to my sons bday!

On 6/18/16 5:10 AM, Jason Chin wrote:

This is what happens for 000017F. In FALCON-Unzip, it will try to main the same being and end from the initial assembly. Somehow, there might be a bug or incomplete data somewhere, it fails to find the initial path. (The code is written that there should be always a path but there might be some unexpected boundary cases.) Since the code fails to find the primary path, it classifies the rest paths as haplotigs.

See the attached plot where I align |h_ctg_all.000017F.fa| to |000017F_ref.fa|. scr2016-06-17_12-04-27_pm https://cloud.githubusercontent.com/assets/1071734/16161770/33766276-3484-11e6-95f8-6af36e876b62.jpg

You do see two major haplotyes. I will need the graph information in |2-falcon-asm| to see what might be the cause of the code failing to find a primary path (contig.)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226856027, or mute the thread https://github.com/notifications/unsubscribe/AGLMhkT3VUnwXc6RN0XZcfBos-GWjnnjks5qMvEXgaJpZM4I4DMN.

Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia

M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger

twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en

CRICOS Provider # 00120C

pb-jchin commented 8 years ago

OK, this is what happens to the contig 000017F, the associated contig graph is a circle (see attached assembly graph). When the code looks for the primary path in the haplotig layout, it returns empty path as the begin and the end node are the same. We can catch such condition in the future. It is also interesting to see if the DNA molecule/the chromosome is indeed circular or the circularization is caused by repeats.

scr2016-06-17_05-41-01_pm

The contig 000017F has the same begin and the end node 000241361:E

$ cat 2-asm-falcon/ctg_paths | grep 000017F | cut -d " " -f 1-4
000017F ctg_linear 000241361:E~NA~000178652:E 000241361:E

We will close this issue once we implement some circle detection and layout correctly.

BenjaminSchwessinger commented 8 years ago

Thanks for that Jason. Indeed I will keep the 000017F_ref.fa for now as part of the primary contigs for downstream analysis.

I will have a look into the circularization bid and let you know how it goes in case you are interested.

Thanks!

Benjamin

On 6/18/16 10:45 AM, Jason Chin wrote:

OK, this is what happens to the contig 000017F, the associated contig graph is a circle (see attached assembly graph). When we look for the primary path in the haplotig layout, it returns empty path as the begin and the end node are the same. We can catch such cotndition in the future. It is also interesting to see if the DNA molecule/the chromosome is indeed circular or the circularization is caused by repeats.

scr2016-06-17_05-41-01_pm https://cloud.githubusercontent.com/assets/1071734/16168026/ae05fd5c-34b2-11e6-98d1-c9e04f2c781f.jpg

The contig 000017F has the same begin and the end node |000241361:E|

|$ cat 2-asm-falcon/ctg_paths | grep 000017F | cut -d " " -f 1-4 000017F ctg_linear 000241361:E~NA~000178652:E 000241361:E |

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON_unzip/issues/20#issuecomment-226911470, or mute the thread https://github.com/notifications/unsubscribe/AGLMhkznjRxs8ZhE2vHQXOnsKWwMQVlWks5qMz-xgaJpZM4I4DMN.

Benjamin Schwessinger PhD. Discovery Early Career Research Award Fellow Rathjen Lab Division of Plant Science Research School of Biology College of Medicine, Biology, and Environment Linnaeus Building (134), Linnaeus Way The Australian National University Canberra ACT 0200 Australia

M: +61 405 919 737 benjamin.schwessinger@anu.edu.au lab webpage http://tinyurl.com/BenSchwessinger

twitter: @schwessinger http://twitter.com/schwessinger blog: http://blushgreengrassatafridayafternoon.wordpress.com/ google scholar: Benjamin Schwessinger http://scholar.google.com.au/citations?user=lEhYW3QAAAAJ&hl=en

CRICOS Provider # 00120C