Open andreaschavez opened 1 year ago
Hello @andreaschavez,
It seems like YaHS/SALSA2 made too many contig breaks. The first thing you could try is to run YaHS with the option --no-contig-ec
which will suppress contig breaks. But with this option, you will likely see a lot of oddness in your HiC maps - either for contigs (you can check those big ones) or for scaffolds after scaffolding.
I am not sure about the problem. Most likely, your HiC data quality is poor. Have you checked the HiC mapping results, such as the mapping rate, mapping quality etc.? Also, is it possible the HiC data was from a different sample or species?
Best, Chenxi
Hi Chenxi: I will give the no-contig-ec command a try.
According to the stats file generated with the Arima pipeline, I believe our Hi-C data is pretty good, with 95% of the intra data being >20kb "long-cis interactions." The Hi-C data were from the same individual sample as the HiFi data. I'll report back. Thanks. Andreas
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Arima Stats | Reads | % reads | description -- | -- | -- | -- All | 312,571,918 | | All inter "trans interactions" | 32,156,505 | 10% | inter/all All intra "short and long cis interactions" | 280,415,413 | 90% | intra/all | | | All intra 1kb | 10,214,550 | | All intra 10kb | 1,875,255 | | All intra 15kb | 948,391 | | All intra 20kb "short-cis interactions" | 591,696 | 5% | all <20kb/intra total All intra >20kb "long-cis interactions" | 266,785,521 | 95% | all >20kb/intra total **All intra "short and long cis interactions" | 280,415,413 | 85% | all >20kb/all**
Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt
Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?
Thank you in advance. cheers, Andreas