PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

consensus-calling with arrow for contigs absent from all_p_ctg.fa but present in p_ctg.fa #75

Open sjin09 opened 7 years ago

sjin09 commented 7 years ago

I have been successfully able to run FALCON (https://github.com/PacificBiosciences/FALCON/issues/514) for the human genome and I am now performing FALCON_UNZIP. FALCON_UNZIP has also been successful, but there were some contigs absent as a result of the graph being circular and returns an empty path #20.

Here, are the assembly statistics for p_ctg.fa.

number_of_contigs: 3,904
contig_N50: 24,379,051 bp
minimum_contig_length: 17 bp
maximum_contig_length: 109,706,220 bp
assembly length: 2,892,837,735

The assembly statistics for all_p_ctg.fa

number_of_contigs: 2,253
contig_N50: 24,379,667 bp
minimum_contig_length: 3,540 bp
maximum_contig_length: 109,710,721 bp
assembly length: 2,857,052,564 bp

I would like to be able to incorporate some of the circular contigs for consensus-calling using arrow. I would love to hear some recommendations for this case.

In addition, I wanted to also inquire about contigs that are completely absent from all_p_ctg.fa but present in the p_ctg.fa. Would it be correct to assume that they have all been incorporated into all_h_ctg.fa? If not, what is the filtering mechanism?

I have also found many of these contigs that were absent or empty contained centromeric sequences. I would probably remove them by matching them against sequences from RepBase, and would want to select out unique sequences for consensus-calling.

Best, Jin

sjin09 commented 7 years ago

I have also been able to observe a number of contigs that have significant changes to their sequences. I have uploaded a dotplot illustrating the example. The horizontal sequence is derived from FALCON while the vertical sequence is derived from FALCON_UNZIP.

000479f

In such cases, do you have recommendations for diagnosing the changes in the sequence, determining why the sequence has been changed and if the sequence change has been erroneous?

I assume that some of the changes are from haplotype differences, but I also observe a number of haplotigs and its respective pair without any significant matches.

Best, Jin