c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

Reduced assembly size after scaffolding #97

Closed eaundhe closed 3 weeks ago

eaundhe commented 3 weeks ago

After using YAHS to scaffold the total assembly size (1915795212 bp) is a lot smaller than that of the original hifiasm assembly (3196025640 bp) and has much worse BUSCO scores (70.2% VS 92.5%).

The HiC reads are from the same individual and were mapped using chromap with sam output, which was converted to bam and sorted using the "-n" option with samtools. I do not get any error messages when running YAHS.

I would be very grateful for any help!

Here's from the log from after chromap:

Number of reads: 2804902694. Number of mapped reads: 1226710730. Number of uniquely mapped reads: 563345138. Number of reads have multi-mappings: 663365592. Number of candidates: 133480720453. Number of mappings: 1226710730. Number of uni-mappings: 563345138. Number of multi-mappings: 663365592. Sorted, deduped and outputed mappings in 13530.48s. No. uni-mappings: 634439832, No. multi-mappings: 214312814, total: 848752646. Number of output mappings (passed filters): 269096852 Total time: 24856.79s. finished writing to SAM file, loading samtools to convert SAM to BAM [bam_sort_core] merging from 15 files and 1 in-memory blocks... [I::main] dump hic links (BAM) to binary file yahs.out.bin [I::dump_links_from_bam_file] 1 million records processed, 494168 read pairs (...) [I::dump_links_from_bam_file] 269 million records processed, 132490317 read pairs [I::dump_links_from_bam_file] position compression n = 439765649, m = 315055700, max_m = 536870908 [I::dump_links_from_bam_file] dumped 132537611 read pairs from 269096852 records: 65428697 intra links + 67108914 inter links [I::calc_avg_cov] sequence coverage stats: [I::calc_avg_cov] sequence bases: 1915072612 [I::calc_avg_cov] read bases: 32711691886 [I::calc_avg_cov] q drop: 0.100 [I::calc_avg_cov] average read coverage: 13.607 [I::run_yahs] RAM total: 3023.017GB [I::run_yahs] RAM limit: 380.000GB [I::contig_error_break] dist threshold for contig error break: 1000000 [I::contig_error_break] performed 4 round assembly error correction. Made 1074 breaks [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 251734 (n = 2235) [I::print_asm_stats] N90: 81136 (n = 7430) [I::print_asm_stats] N100: 1000 (n = 11536) [I::run_yahs] scaffolding round 1 resolution = 10000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.000 [I::inter_link_norms] average link count: 109201.602 336993082.000 0.000 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 1 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 442158 (n = 1263) [I::print_asm_stats] N90: 126000 (n = 4379) [I::run_yahs] scaffolding round 2 resolution = 20000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.000 [I::inter_link_norms] average link count: 82929.118 229807181.000 0.000 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 2 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 658942 (n = 860) [I::print_asm_stats] N90: 169490 (n = 3014) [I::run_yahs] scaffolding round 3 resolution = 50000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.002 [I::inter_link_norms] average link count: 127785.054 116788044.000 0.001 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 3 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 1049209 (n = 536) [I::print_asm_stats] N90: 211129 (n = 2079) [I::run_yahs] scaffolding round 4 resolution = 100000 [I::run_scaffolding] starting norm estimation... [I::run_scaffolding] starting link estimation... [I::inter_link_norms] using noise level 0.021 [I::inter_link_norms] average link count: 91774.279 35565816.000 0.003 [I::run_scaffolding] starting scaffolding graph contruction... [I::run_yahs] scaffolding round 4 done [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 1565375 (n = 337) [I::print_asm_stats] N90: 211702 (n = 1579) [I::run_yahs] scaffolding round 5 resolution = 200000 [I::run_yahs] assembly N50 (1565375) too small. End of scaffolding. [I::main] writing FASTA file for scaffolds [I::write_fasta_file_from_agp] Number sequences: 4310 [I::write_fasta_file_from_agp] Number bases: 1915795212 [I::print_asm_stats] assembly stats: [I::print_asm_stats] N50: 1565375 (n = 337) [I::print_asm_stats] N90: 211702 (n = 1579) [I::print_asm_stats] N100: 1000 (n = 4310) [I::main] Version: 1.2 [I::main] CMD: yahs ref.fa ref.fa.chromap.aln.bam

c-zhou commented 3 weeks ago

Hello @eaundhe,

That is weird as YaHS should not drop any sequence. Can you check the index file for ref.fa (i. e., ref.fa.fai) to see if it is up to date? You can do something like awk '{s+=$2}END{print NR,s}' ref.fa.fai to check the number and the total size of the sequences.

Best, Chenxi

c-zhou commented 3 weeks ago

Also, the log file says the total number of bases in the file is 1915072612 as in line [I::calc_avg_cov] sequence bases: 1915072612. You should probably also check the ref.fa to see if it is the correct one.

One more thing about scaffolding is that YaHS performed 4 round assembly error correction. Made 1074 breaks, which is a bit concerning as this seems way too much. We had some discussions in this thread https://github.com/c-zhou/yahs/issues/53#issuecomment-2429022644.

Chenxi

eaundhe commented 3 weeks ago

You're totally right. That's pretty embarrassing but thanks for picking that up! Also, thanks for commenting on the number of breaks, which was my other, less pressing question. I will have a go with the suggestions in the other thread and let you know if I still have issues.

eaundhe commented 3 weeks ago

You're totally right. That's pretty embarrassing but thanks for picking that up! Also, thanks for commenting on the number of breaks, which was my other, less pressing question. I will have a go with the suggestions in the other thread and let you know if I still have issues.