PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

n50 prereads & GC content #583

Open hermeseduardo opened 6 years ago

hermeseduardo commented 6 years ago

Hi there, It is normal to loss considerable N50 length after the correction step?, in my case when from N50 13000 to N50 7000 I have about 40X of coverage bellow my pre_assembly_stats.json and fc_run.cfg I was suspecting of the GC content (30%) may affect DALIGNER, any clue regarding this?

thanks

{ "genome_length": 550000000, "length_cutoff": 1000, "preassembled_bases": 13073338353, "preassembled_coverage": 23.77, "preassembled_esize": 8290.033, "preassembled_mean": 4874.457, "preassembled_n50": 7271, "preassembled_p95": 12989, "preassembled_reads": 2682009, "preassembled_seed_fragmentation": 1.443, "preassembled_seed_truncation": 3720.872, "preassembled_yield": 0.583, "raw_bases": 22478005278, "raw_coverage": 40.869, "raw_esize": 14585.894, "raw_mean": 9948.45, "raw_n50": 13241, "raw_p95": 22760, "raw_reads": 2259448, "seed_bases": 22442989421, "seed_coverage": 40.805, "seed_esize": 14607.423, "seed_mean": 10139.719, "seed_n50": 13254, "seed_p95": 22877, "seed_reads": 2213374 }

[General] input_fofn = input.fofn input_type = raw length_cutoff = 1000 genome_size = 550000000 length_cutoff_pr = 10000

sge_option_da = --ntasks 1 --nodes 1 --cpus-per-task 8 --mem 30gb --time 5:30:00 sge_option_la = --ntasks 1 --nodes 1 --cpus-per-task 4 --mem 32gb --time 4:56:00 sge_option_cns = --ntasks 1 --nodes 1 --cpus-per-task 5 --mem 32gb --time 3:00:00 sge_option_pda = --ntasks 1 --nodes 1 --cpus-per-task 8 --mem 30gb --time 3:30:00 sge_option_pla = --ntasks 1 --nodes 1 --cpus-per-task 4 --mem 35gb --time 3:58:00 sge_option_fc = --ntasks 1 --nodes 1 --cpus-per-task 8 --mem 20gb --time 59:00

da_concurrent_jobs = 396 la_concurrent_jobs = 396 cns_concurrent_jobs = 396 pda_concurrent_jobs = 396 pla_concurrent_jobs = 396

pa_HPCdaligner_option = -v -B70 -t16 -e.70 -l1000 -s1000 ovlp_HPCdaligner_option = -v -B70 -t32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s120 ovlp_DBsplit_option = -x500 -s120

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 2 --max_n_read 200

overlap_filtering_setting = --max_diff 100 --max_cov 200 --min_cov 1 --bestn 1

skip_checks = true

gconcepcion commented 6 years ago

Hi,

Yes, though dependent on the nature(read length/quality) and quantity of your input data, seeing a decrease in N50 from raw reads to corrected preads is typical, especially in a coverage limited situation. Long reads are often broken during the correction process in low coverage situations, resulting in an overall decrease in N50.

hermeseduardo commented 6 years ago

OK thanks. Do you know if there is anything that can be done to help? eg. reduce -e.70 to -e.60, or it may be 'bad' for the final assembly? I am also currently trying with the -b option for daligner, apparently it helps when there is compositional bias. pa_HPCdaligner_option = -vb .......... ovlp_HPCdaligner_option = -vb ........