Incomplete Hi-C assembly?

Overcraft90 commented 2 years ago

Hi,

I successfully assembled an A. thaliana genome for which I obtained two partially phased haplotypes of size 148.1 Mb (hap1) and 146.4 Mb (hap2), respectively. These files are in .fasta format.

However, when I integrate my HiFi (CCS) data with Hi-C information the raw output for hap1 .asm.hic.hap1.p_ctg.gfa is only 66.1 Mb, which give a .fasta file of only 65.6 Mb. How is that possible? For the other haplotype, hap2, the tool seems to work just fine, outputting a .fasta of 161.1 Mb (bigger in size than when just using HiFi data).

This is important because I think that is the root of all problems I'm having while performing quality check on the assembly, such as BUSCO and Merqury.

Let me know what can I do about that. In the meantime, I'm trying to assemble the genome again hoping for a better result.

Thanks in advance, Matteo

chhylp123 commented 2 years ago

Could you please share the log files in both cases? If the size of non-Hi-C assemblies is balanced, the Hi-C assemblies should also be ok. One possibility is that are you using the Hi-C reads from the same sample? See FAQs here: https://hifiasm.readthedocs.io/en/latest/faq.html#how-can-i-tweak-parameters-to-improve-hi-c-integrated-assembly.

Overcraft90 commented 2 years ago

Unfortunately, I didn't save the screen output; however, it might very likely be for the reason you just mentioned. Can i access the log file from somewhere else?

P.S. I'm not working in Conda environment

Thanks

chhylp123 commented 2 years ago

You can just rerun hifiasm with the same bin files, which is quite fast.

Overcraft90 commented 2 years ago

I'm not quite sure how to do it with bin files...; however, I confirm that your hypothesis I might have used the same source of Hi-C information was correct.

After double-checking online, I realised I downloaded the data for only one of the parental samples. Thanks a lot, I will keep you updated whether any other problem should arise. Anyway, it would be interesting if you could tell me how to run hifiasm with bin files, just for the future in case I will need to post the log files for having more complex issues fixed.

chhylp123 commented 2 years ago

If you already have bine files (see: https://hifiasm.readthedocs.io/en/latest/faq.html?highlight=bin#what-s-the-usage-of-different-bin-files-in-hifiasm), just rerun hifiam with the same option -o.

Overcraft90 commented 2 years ago

Thanks a lot for your help. It just happened that running with Hi-C (both parental samples) resulted in an error:

*** stack smashing detected ***: terminated Aborted (core dumped)

I looked up online and this problem seems related to GPU out of memory. I must say that I'm running hifiasm on a DELL Precision 7750 equipped with a NVIDIA Quadro RTX 4000 (8 Gb dedicated) and 64 Gb of RAM. I also mount an Intel CORE i9-10885H @ 2.40 GHz × 16.

I really don't want to bother with this type of issues but I've notice you answered similar questions, so I was wondering do you think this machine can run this type of analysis considering the studied species (A. thaliana) and the very high coverage (~ 157×)? Thanks again!

Overcraft90 commented 2 years ago

Hi chhylp123,

Did you had the chance to look at the last comment? In the meantime I leave you with both plots for HiFi (first row) and Hi-C (second row), respectively.

BUSCO

As you can see the hap1_hi-c is far worst than hap1 with HiFi only. Let me know, thanks.

chhylp123 commented 2 years ago

Sorry I missed your previous question. Probably you need more RAM to assemble such high coverage HiFi reads, or you could use a little bit lower coverage. I guess the assembly with 100X reads will be very similar to that with 150x reads.

For the second question, are you still using the wrong Hi-C reads? If yes, the hap1_hi-c is definitely worse than hap1 with HiFi only.

Overcraft90 commented 2 years ago

No worries at all, no need to apologise. I will try to downsample in order to reduce the coverage.

However, I just spoke with the authors of the paper where they used the Hi-C information I downloaded to assemble haplotype-resolved genome sequences for both A. thaliana haplotypes. They said that I'm using the correct Hi-C reads; with that said I will try to assemble one more time and run a BUSCO, it might be I'm doing something wrong...

This time I will make sure to have a .log file and I will quote my command line, so you can have a better view of the situation.

Thanks, Matteo

Overcraft90 commented 2 years ago

Hi again,

I got the result of the second run. This time I can attach the log file; however, I will first comment the command line I used:

hifiasm -o arabidopsis_thaliana_hic/arabidopsis_thaliana_hic -t16 --h1 arabidopsis_thaliana/Hi-C/CRR302669_f1.fastq.gz --h2 arabidopsis_thaliana/Hi-C/CRR302669_r2.fastq.gz arabidopsis_thaliana/HiFi/CRR302668.fastq.gz

Following, the screen output after running the previous command: output.log

With that said, it seems that something strange happened. When I first generated the hic.hap1.p_ctg.gfa and the hic.hap2.p_ctg.gfa, they were both the right size (196.5 and 165.7 GB, respectively); for some reasons, after I booted back into Linux the hap1 was of a size in GB equal to 66.1 which is what generates the bad BUSCO gene completeness assessment that I showed you before.

Do you have any idea why this happens? Could it be related to my machine being a dual boot Win/Lin... I don't think so but I'm not an expert. Maybe you can spot the issue in the log file. Let me know, thanks!

Matteo

chhylp123 commented 2 years ago

I don't think machine will affect the results. Hifiasm will output the exactly same results if you give it the same command lines. Probably you should check if you are using different bin files, or some other problems. And could you please let me know what's the coverage of your dataset to the haploid genome size? Is it 280x? Based on the log file, looks like hifiasm identified the wrong threshold for the homozygous peak.

Overcraft90 commented 2 years ago

Hi again after long,

I'm almost sure the problem was related to a wrong storage of files online. They might have swapped one file short reads dataset with one HiC (however I'm not sure about that). The good news is that, since for me the idea was to learn the procedure, I moved on testing the approach on four human individuals.

They all returned good BUSCO scores for both haplotypes and I'm now running Merqury to assess overall assembly completeness.

However, coming back to your question I think the coverage of my dataset of A. thaliana is ~ 157× overall. Now that you mentioned this issue, I also realised the peak at 280 which seemed odd, but I also was not completely sure on how to contextualise this.

chhylp123 commented 2 years ago

You could simply set --hom-cov to 157.

Overcraft90 commented 2 years ago

Thanks a lot, I see what you mean. My first approach has been to run the tool without any additional parameter to have a feeling of the raw output before, eventually, doing some fine tuning. I will try again and see how it performs with that setting.

Thanks for your help, in general! I think I solved my problem and maybe I/we can close this Issue.

dandanWang2019 commented 4 months ago

Hi,

I have followed all the discussion under this question. A situation kind of similar in my assembly showed the balanced haplotypes without HiC data (both: BUSCO of 90.1%). However, the HiC data integrated separated imbalanced assemblies (hap1: C:87.8%[S:86.4%,D:1.4%],F:0.7%,M:11.5%; hap2: C:97.6%[S:96.5%,D:1.1%],F:0.6%,M:1.8%). The hap1 showed larger missing BUSCO value compared to hap2. In addition, the hap1 (~430 Mb) is ~50 Mb shorter than the hap2 (~484 Mb).

I wonder if this is a big difference between these two haplotypes and what are the possibilities for this difference?

chhylp123 / hifiasm

Incomplete Hi-C assembly? #249