c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

Question about the mapping process for yahs #98

Open ericgonzalezs opened 2 weeks ago

ericgonzalezs commented 2 weeks ago

I have alignments from juicer, and I would like to use them with Yahs.

I woudl like to know your opinion regarding

1.- Using the juicer mapping output for Yahs 2.- Is it better to use the sam file, order it, convert it to bam and mark duplicates or 3.- Is it better to use the merged_nodups.txt file to generate the bed file for Yahs?

Many thanks,

Eric

ericgonzalezs commented 1 week ago

In the end I ran it by converting merged_nodups.txt to a bed file that looks like this:

h1tg000001l 19256 19406 LH00251_107:7:1201:47093:21339/1 5 h1tg000001l 14074081 14074231 LH00251_107:7:1201:47093:21339/2 3 h1tg000001l 20225 20375 LH00251_107:7:1316:38313:21325/1 22 h1tg000001l 766119 766269 LH00251_107:7:1316:38313:21325/2 8 h1tg000001l 21193 21323 LH00251_107:7:2216:16634:25289/1 34 h1tg000001l 21341 21472 LH00251_107:7:2216:16634:25289/2 32 h1tg000001l 20933 21083 LH00251_107:7:2413:3670:8424/1 60 h1tg000001l 35735 35885 LH00251_107:7:2413:3670:8424/2 19 h1tg000001l 21192 21323 LH00251_107:6:2219:38944:14153/1 21 h1tg000001l 52909 53059 LH00251_107:6:2219:38944:14153/2 22

The program is running, but I am just a little worried about this in the output file:

[I::dump_links_from_bed_file] 403 million records processed, 201500000 read pairs [I::dump_links_from_bed_file] 404 million records processed, 202000000 read pairs [I::dump_links_from_bed_file] 405 million records processed, 202500000 read pairs [I::dump_links_from_bed_file] 406 million records processed, 203000000 read pairs [I::dump_links_from_bed_file] position compression n = 615269238, m = 535425155, max_m = 1073741816 [I::dump_links_from_bed_file] dumped 203371445 read pairs from 406742890 records: 94802654 intra links + 108568791 inter links [I::calc_avg_cov] sequence coverage stats: [I::calc_avg_cov] sequence bases: 3643300973 [I::calc_avg_cov] read bases: 52503593824 [I::calc_avg_cov] q drop: 0.100 [I::calc_avg_cov] average read coverage: 11.471 [I::run_yahs] RAM total: 251.353GB [I::run_yahs] RAM limit: 180.998GB [I::contig_error_break] dist threshold for contig error break: 1000000 [I::contig_error_break] performed 3 round assembly error correction. Made 27 breaks [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 7793971 (n = 139) [I::print_asm_stats] N90: 2184371 (n = 464) [I::print_asm_stats] N100: 1000 (n = 1913) [I::run_yahs] scaffolding round 1 resolution = 10000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.000 [I::inter_link_norms] average link count: 3284.521 12027925.000 0.000 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 1 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 38391129 (n = 30) [I::print_asm_stats] N90: 11170798 (n = 99) [I::run_yahs] scaffolding round 2 resolution = 20000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.000 [I::inter_link_norms] average link count: 1394.640 2879060.000 0.000 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 2 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 109261304 (n = 13) [I::print_asm_stats] N90: 29910359 (n = 35) [I::run_yahs] scaffolding round 3 resolution = 50000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.000 [I::inter_link_norms] average link count: 4587.375 1079574.000 0.004 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 3 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 191142718 (n = 9) [I::print_asm_stats] N90: 85369809 (n = 19) [I::run_yahs] scaffolding round 4 resolution = 100000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.007 [I::inter_link_norms] average link count: 9593.958 777589.000 0.012 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 4 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 182497175 (n = 16) [I::run_yahs] scaffolding round 5 resolution = 200000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.038 [I::inter_link_norms] average link count: 13559.152 867662.000 0.016 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 5 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 182497175 (n = 16) [I::run_yahs] scaffolding round 6 resolution = 500000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.192 [I::inter_link_norms] average link count: 12112.385 785121.000 0.015 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 6 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] scaffolding round 7 resolution = 1000000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.731 [I::inter_link_norms] using new radius 88 (100) as noise ratio exceeds threshold 0.100 [I::inter_link_norms] average link count: 9193.037 523549.000 0.018 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 7 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] scaffolding round 8 resolution = 2000000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 4.732 [I::inter_link_norms] using new radius 34 (74) as noise ratio exceeds threshold 0.100 [I::inter_link_norms] average link count: 828.656 76857.000 0.011 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 8 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] scaffolding round 9 resolution = 5000000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 29.365 [I::inter_link_norms] using new radius 14 (29) as noise ratio exceeds threshold 0.100 [I::inter_link_norms] average link count: 157.923 12376.000 0.013 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 9 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] scaffolding round 10 resolution = 10000000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 116.716 [I::inter_link_norms] using new radius 7 (14) as noise ratio exceeds threshold 0.100 [I::inter_link_norms] average link count: 58.049 2856.000 0.020 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 10 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] scaffolding round 11 resolution = 20000000 [I::run_scaffolding] starting norm estimation... [E::calc_norms] no enough bands (9) for norm calculation, try a higher resolution [W::run_scaffolding] No enough bands for norm calculation... End of scaffolding round. [I::run_yahs] scaffolding round 11 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] assembly N50 (208511985) too small. [I::run_yahs] do an extra round scaffolding with resolution = 20000000 [I::run_yahs] scaffolding round 12 resolution = 20000000 [I::run_scaffolding] starting norm estimation... [E::calc_norms] no enough bands (9) for norm calculation, try a higher resolution [W::run_scaffolding] No enough bands for norm calculation... End of scaffolding round. [I::run_yahs] scaffolding round 12 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::run_yahs] assembly N50 (208511985) too small. End of scaffolding. [I::run_yahs] consider running with increased memory limit if there was a memory issue. [I::main] writing FASTA file for scaffolds [I::write_fasta_file_from_agp] Number sequences: 1132 [I::write_fasta_file_from_agp] Number bases: 3643379073 [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 208511985 (n = 9) [I::print_asm_stats] N90: 186549744 (n = 16) [I::print_asm_stats] N100: 1000 (n = 1132) [I::main] Version: 1.2.2 [I::main] CMD: /home/egonza02/scratch/SOFTWARE/YAHS/yahs/yahs Harg2202_HIC_oldhifiasm.hic.hap1.fasta merged_nodups_for_yahs.bed [I::main] Real time: 585.078 sec; CPU: 577.369 sec; Peak RSS: 7.221 GB

I am just worried about two things: 1.- Half of the reads, that were previously filtrated by MQ, were discarded. Is this normal? 2.- N50 (208511985) too small. End of scaffolding.

In the end, when I check the output scaffolded fasta file, it looks good, with the number of expected chromosomes and each have the expected size. But I am not sure If I have to consider this as a failed run because of the two things I mentioned above.

Many thanks,

Eric

c-zhou commented 1 week ago

Hi Eric,

Both two things are normal. The fraction of reads dropped by MQ filtering varies depending on the repetitiveness of your genome. Dropping around half the reads is reasonable. If you are interested, you can change the -q option to 1 and see if it makes a noticeable difference. There is no need to worry about the "End of scaffolding" warning. By default, YaHS attempts to use window sizes up to hundreds of megabases to handle chromosomes that are gigabases in length. Since your genome is smaller, the scaffolding process ends earlier.

Best, Chenxi