PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
204 stars 103 forks source link

Primary and Associated Contigs: are they truly haplotypes? #458

Closed danshu closed 7 years ago

danshu commented 7 years ago

Hi,

As mentioned on this page (https://github.com/PacificBiosciences/FALCON/wiki/Tips), associated contigs are likely generated by (1) some residue sequencing errors and (2) segmental duplications for haploid genome; and most of the associated contigs will be locally alternative alleles for a diploid genome. However, when there are many long repeats (longer than read length of pacbio reads) in a diploid genome, then the associated contigs can also be non-homologous sequences flanked by long repeats? Is this true for falcon? Or does falcon have the ability to avoid this kind of fake associated contigs?

Best, Quan

pb-jchin commented 7 years ago

There are many possibilities that involved repeats. Do you have some example? If the bubbles are induce by repeats, then the alignment identity between the haplotigs and the primary contigs will be low. Computer code does not understand all possible biology especially before some possible scenario is know. We can speculate what might happen but we should always check the results with some independent methods.

danshu commented 7 years ago

Thanks. I will then check the results. If I found that some haplotigs show little identity with primary contigs, I would like to construct the full "associated contig" by replacing the two nodes in the path of the corresponding primary contig, how could I do this with falcon? Sorry that I'm not familiar with these assembly graphs. or it is more convenient to simply replace the sequence of "a_ctg_base" by corresponding "a_ctg" in the primary contigs?

Another question is why the numbers of sequences are different between "a_ctg_base.fa" and "a_ctg.fa"? Does the associated contig ">000000F-001-01" correspond to ">000000F-001-00" in a_ctg_base.fa?

Thanks, Quan

pb-jchin commented 7 years ago

a_ctg_base.fa contains those sequences in the primary contig that are corresponding to the alternative contig.

Does the associated contig ">000000F-001-01" correspond to ">000000F-001-00" in a_ctg_base.fa?

yes.

pb-jchin commented 7 years ago

Also, these are not "haplotigs" yet. They are just alternative branches of the bubbles in a assembly graph. They typically contains some haplotype specific structural variations.

danshu commented 7 years ago

Thanks for your explanation! Sorry that I haven't described my questions clearly.

  1. Are associated contigs (a_ctg.fa) only corresponding to the bubble part or they also include the sequences flanking the bubbles?
  2. If associated contigs (a_ctg.fa) don't contain the sequences flanking the bubbles, how can I recover these sequences?
  3. Why the numbers of sequences are different between "a_ctg_base.fa" and "a_ctg.fa"? For example, I may find ">000000F-001-00" in a_ctg_base.fa but no corresponding associated contig ">000000F-001-01" in a_ctg.fa.

Best, Quan

pb-jchin commented 7 years ago
  1. Are associated contigs (a_ctg.fa) only corresponding to the bubble part or they also include the sequences flanking the bubbles?
  2. If associated contigs (a_ctg.fa) don't contain the sequences flanking the bubbles, how can I recover these sequences?

No it does not contain flanking sequences. It is a faithful representation of the underlying string graph path that does not include the flanking regions. You will need to fetch the graph data to pad flanking sequences if necessary.

  1. Why the numbers of sequences are different between "a_ctg_base.fa" and "a_ctg.fa"? For example, I may find ">000000F-001-00" in a_ctg_base.fa but no corresponding associated contig ">000000F-001-01" in a_ctg.fa.

>000000F-001-00 in a_ctg_base.fa is the counter part of >000000F-001-01 in the primary contigs. You can ignore >000000F-001-00 unless you want to find internal variants between >000000F-001-01 and the primary contig fast.