PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

Assembling telomeres #39

Open KSchnee opened 8 years ago

KSchnee commented 8 years ago

I tried to assemble a Candida interspecies hybrid (diploid) under the assumption that alleles differ by ~5% using FALCON-Unzip. I tried different assemblers and so far the FALCON-Unzip assembly appears to be the most promising. What is bugging me is that only one of two telomeres at the end of each contig gets assembled. In this organism, telomeres appear to be 1-2 kbp long repetitive sequences with a very high content of G and T (>70%). I tried various different parameter settings but I am just not able to get both telomeres. Here is the config file for the best assembly so far:

job_type = local

input_fofn = input.fofn

input_type = raw

length_cutoff =5000

length_cutoff_pr = 5000

jobqueue = production
sge_option_da =
sge_option_la =
sge_option_pda =
sge_option_pla =
sge_option_fc =
sge_option_cns =

pa_concurrent_jobs = 4
cns_concurrent_jobs = 4
ovlp_concurrent_jobs = 4

pa_HPCdaligner_option =  -v -B4 -M20 -b -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -B4 -M20 -b -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s100
ovlp_DBsplit_option = -x500 -s100

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 5 --max_n_read 200 --n_core 10

overlap_filtering_setting = --min_cov 4 --max_cov 1200 --max_diff 1200 --bestn 15 --n_core 10

In total ~6 Gbp were sequenced and I assume that the diploid genome size is ~32Mbp. Thus, the average coverage is ~187.5X. What am I doing wrong? How should i set parameters in order to get both telomeres assembled?