c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

Assembly metrics have gotten worse than hifiasm-only assembly #53

Open andreaschavez opened 1 year ago

andreaschavez commented 1 year ago

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

c-zhou commented 1 year ago

Hello @andreaschavez,

It seems like YaHS/SALSA2 made too many contig breaks. The first thing you could try is to run YaHS with the option --no-contig-ec which will suppress contig breaks. But with this option, you will likely see a lot of oddness in your HiC maps - either for contigs (you can check those big ones) or for scaffolds after scaffolding.

I am not sure about the problem. Most likely, your HiC data quality is poor. Have you checked the HiC mapping results, such as the mapping rate, mapping quality etc.? Also, is it possible the HiC data was from a different sample or species?

Best, Chenxi

andreaschavez commented 1 year ago

Hi Chenxi: I will give the no-contig-ec command a try.

According to the stats file generated with the Arima pipeline, I believe our Hi-C data is pretty good, with 95% of the intra data being >20kb "long-cis interactions." The Hi-C data were from the same individual sample as the HiFi data. I'll report back. Thanks. Andreas

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Arima Stats | Reads | % reads | description -- | -- | -- | -- All | 312,571,918 |   |   All inter "trans interactions" | 32,156,505 | 10% | inter/all All intra "short and long cis interactions" | 280,415,413 | 90% | intra/all   |   |   |   All intra 1kb | 10,214,550 |   |   All intra 10kb | 1,875,255 |   |   All intra 15kb | 948,391 |   |   All intra 20kb "short-cis interactions" | 591,696 | 5% | all <20kb/intra total All intra >20kb "long-cis interactions" | 266,785,521 | 95% | all >20kb/intra total **All intra "short and long cis interactions" | 280,415,413 | 85% | all >20kb/all**

afiyachida commented 7 months ago

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

Hi Andreas,

I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.

Thanks, Afiya

richarddurbin commented 7 months ago

Does the Hi-C come from the same individual as the PacBio?

From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.***> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53)

Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt

Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data?

Thank you in advance. cheers, Andreas

Hi Andreas,

I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly.

Thanks, Afiya

— Reply to this email directly, view it on GitHubhttps://github.com/c-zhou/yahs/issues/53#issuecomment-2043666174, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

afiyachida commented 7 months ago

Does the Hi-C come from the same individual as the PacBio? From: afiyachida @.> Date: Monday, 8 April 2024 at 22:23 To: c-zhou/yahs @.> Cc: Subscribed @.> Subject: Re: [c-zhou/yahs] Assembly metrics have gotten worse than hifiasm-only assembly (Issue #53) Hi Zhou: After running yahs, I have gotten worse assembly metrics than with the hifiasm-only assembly. The number of genome scaffolds has increased from 1,200 to 9,500, the scaffold N50s have decreased from 32MB to 9MB, and the max scaffold length has declined from 200 MB to 100 MB. My assembly metrics also became worse when using SALSA2 and when running HiFiasm with Hi-C integration. I am using Hi-C data from Dovetail Genomics that was developed using their Chicago Library approach, and I used the Arima pipeline to generate BAM files. I have wondered if there is an issue with my Hi-C data, but my stats file from the Arima pipeline suggests the Hi-C data is good. We have 30X coverage with HiFi data. We have a mammal species with a challenging genome to assemble because of its relatively large genome size (6GB), high repeat content (~50%), and high levels of heterozygosity. We are studying a diploid species with 24 chromosomes. slurm-23990094.out.txthttps://github.com/c-zhou/yahs/files/11029482/slurm-23990094.out.txt Do you have thoughts on why our HiFi assembly is getting worse when we scaffold with Hi-C data? Thank you in advance. cheers, Andreas Hi Andreas, I am facing a similar problem and was wondering if you were able to solve the issue? I have used "no-contig-ec" in my existing command but still facing the same problem of better stats at the contig level assembly. Thanks, Afiya — Reply to this email directly, view it on GitHub<#53 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZXPQZLHEPDSDVCC7R3Y4MDGVAVCNFSM6AAAAAAWCNCCPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGY3DMMJXGQ. You are receiving this because you are subscribed to this thread.Message ID: @.>

Hello,

Yes. The HiC is from the same individual. I am either facing better metrics at the contig level or, in certain cases, the metrics remain the same as at contig level assembly and do not improve.

Thanks, Afiya

gunjanpandey commented 1 month ago

I have the same problem. YAHS is breaking full length chromosomes into smaller pieces and --no-scaffold-ec seems to have no effect.

Please provide a fix.

c-zhou commented 1 month ago

Hi @gunjanpandey,

There are several possibilities. It could simply because your HiC data is not good enough, or could because there are a lot of miss assemblies. Instead of instead stead of '--no-scaffold-ec', you need '--no-contig-ec', which ask YaHS to not make any config error corrections before scaffolding. While this will prevent YaHs from getting a worse assembly, it is very likely you will not see much improvement on the assembly contiguity after scaffolding.

A more likely reason is that your genome is very repetitive. YaHS uses a mapping quality threshold of 10 for alignment filtering by default. With this filtering, many regions of your genome will be no HiC links. You can make a HiC plot for a quick check. If that is the case, you could probably try to run YaHS with -q 0, which will force YaHS to use all HiC links irrespective of mapping qualities - this is kind of risky though.

Best, Chenxi

c-zhou commented 1 month ago

Also, you need to make sure the low mapping quality reads were not filtered out in your input alignment file if want to run YaHS with -q 0. You can use the BAM file input, which usually includes all alignments.

Chenxi

Asrix commented 2 weeks ago

Hi,

I also have a similar problem of getting way more scaffolds (2270) than I had original contigs (202). I did check my indexed genome and it has 202 contigs. My yahs log says that it made 630 breaks. I will try running it with --no-contig-ec and see if that changes it.

file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%) SNPE.fa FASTA DNA 202 1,316,215,448 17,334 6,515,918.1 145,371,723 66,469 248,476.5 5,230,195 0 31,483,559 11 0 0 0 43.53 SNPE_HiC_scaffolds_final.fa FASTA DNA 2,270 1,316,215,448 1,000 579,830.6 23,668,000 51,730 89,232.5 237,000 0 3,504,000 100 0 0 0 43.53

By using --no-contig-ec, I do get improved assembly: file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%) SNPE_L3_nomito_purged2.np2_HiC_2_scaffolds_final.fa FASTA DNA 196 1,316,216,048 17,334 6,715,388 183,056,701 64,108 223,720 3,235,499 0 42,376,713 9 0 0 0 43.53

In my case, my HiC data is from a different individual as that was the only option. What is the best way to deal with that issue?

Thanks!