Open Overcraft90 opened 2 years ago
Could you please share the log files in both cases? If the size of non-Hi-C assemblies is balanced, the Hi-C assemblies should also be ok. One possibility is that are you using the Hi-C reads from the same sample? See FAQs here: https://hifiasm.readthedocs.io/en/latest/faq.html#how-can-i-tweak-parameters-to-improve-hi-c-integrated-assembly.
Unfortunately, I didn't save the screen output; however, it might very likely be for the reason you just mentioned. Can i access the log file from somewhere else?
P.S. I'm not working in Conda environment
Thanks
You can just rerun hifiasm with the same bin files, which is quite fast.
I'm not quite sure how to do it with bin files...; however, I confirm that your hypothesis I might have used the same source of Hi-C information was correct.
After double-checking online, I realised I downloaded the data for only one of the parental samples. Thanks a lot, I will keep you updated whether any other problem should arise. Anyway, it would be interesting if you could tell me how to run hifiasm with bin files, just for the future in case I will need to post the log files for having more complex issues fixed.
If you already have bine files (see: https://hifiasm.readthedocs.io/en/latest/faq.html?highlight=bin#what-s-the-usage-of-different-bin-files-in-hifiasm), just rerun hifiam with the same option -o
.
Thanks a lot for your help. It just happened that running with Hi-C (both parental samples) resulted in an error:
*** stack smashing detected ***: terminated Aborted (core dumped)
I looked up online and this problem seems related to GPU out of memory. I must say that I'm running hifiasm on a DELL Precision 7750 equipped with a NVIDIA Quadro RTX 4000 (8 Gb dedicated) and 64 Gb of RAM. I also mount an Intel CORE i9-10885H @ 2.40 GHz × 16.
I really don't want to bother with this type of issues but I've notice you answered similar questions, so I was wondering do you think this machine can run this type of analysis considering the studied species (A. thaliana) and the very high coverage (~ 157×)? Thanks again!
Hi chhylp123,
Did you had the chance to look at the last comment? In the meantime I leave you with both plots for HiFi (first row) and Hi-C (second row), respectively.
As you can see the hap1_hi-c is far worst than hap1 with HiFi only. Let me know, thanks.
Sorry I missed your previous question. Probably you need more RAM to assemble such high coverage HiFi reads, or you could use a little bit lower coverage. I guess the assembly with 100X reads will be very similar to that with 150x reads.
For the second question, are you still using the wrong Hi-C reads? If yes, the hap1_hi-c is definitely worse than hap1 with HiFi only.
No worries at all, no need to apologise. I will try to downsample in order to reduce the coverage.
However, I just spoke with the authors of the paper where they used the Hi-C information I downloaded to assemble haplotype-resolved genome sequences for both A. thaliana haplotypes. They said that I'm using the correct Hi-C reads; with that said I will try to assemble one more time and run a BUSCO, it might be I'm doing something wrong...
This time I will make sure to have a .log file and I will quote my command line, so you can have a better view of the situation.
Thanks, Matteo
Hi again,
I got the result of the second run. This time I can attach the log file; however, I will first comment the command line I used:
hifiasm -o arabidopsis_thaliana_hic/arabidopsis_thaliana_hic -t16 --h1 arabidopsis_thaliana/Hi-C/CRR302669_f1.fastq.gz --h2 arabidopsis_thaliana/Hi-C/CRR302669_r2.fastq.gz arabidopsis_thaliana/HiFi/CRR302668.fastq.gz
Following, the screen output after running the previous command: output.log
With that said, it seems that something strange happened. When I first generated the hic.hap1.p_ctg.gfa and the hic.hap2.p_ctg.gfa, they were both the right size (196.5 and 165.7 GB, respectively); for some reasons, after I booted back into Linux the hap1 was of a size in GB equal to 66.1 which is what generates the bad BUSCO gene completeness assessment that I showed you before.
Do you have any idea why this happens? Could it be related to my machine being a dual boot Win/Lin... I don't think so but I'm not an expert. Maybe you can spot the issue in the log file. Let me know, thanks!
Matteo
I don't think machine will affect the results. Hifiasm will output the exactly same results if you give it the same command lines. Probably you should check if you are using different bin files, or some other problems. And could you please let me know what's the coverage of your dataset to the haploid genome size? Is it 280x? Based on the log file, looks like hifiasm identified the wrong threshold for the homozygous peak.
Hi again after long,
I'm almost sure the problem was related to a wrong storage of files online. They might have swapped one file short reads dataset with one HiC (however I'm not sure about that). The good news is that, since for me the idea was to learn the procedure, I moved on testing the approach on four human individuals.
They all returned good BUSCO scores for both haplotypes and I'm now running Merqury to assess overall assembly completeness.
However, coming back to your question I think the coverage of my dataset of A. thaliana is ~ 157× overall. Now that you mentioned this issue, I also realised the peak at 280 which seemed odd, but I also was not completely sure on how to contextualise this.
You could simply set --hom-cov
to 157.
Thanks a lot, I see what you mean. My first approach has been to run the tool without any additional parameter to have a feeling of the raw output before, eventually, doing some fine tuning. I will try again and see how it performs with that setting.
Thanks for your help, in general! I think I solved my problem and maybe I/we can close this Issue.
Hi,
I have followed all the discussion under this question. A situation kind of similar in my assembly showed the balanced haplotypes without HiC data (both: BUSCO of 90.1%). However, the HiC data integrated separated imbalanced assemblies (hap1: C:87.8%[S:86.4%,D:1.4%],F:0.7%,M:11.5%; hap2: C:97.6%[S:96.5%,D:1.1%],F:0.6%,M:1.8%). The hap1 showed larger missing BUSCO value compared to hap2. In addition, the hap1 (~430 Mb) is ~50 Mb shorter than the hap2 (~484 Mb).
I wonder if this is a big difference between these two haplotypes and what are the possibilities for this difference?
Hi,
I successfully assembled an A. thaliana genome for which I obtained two partially phased haplotypes of size 148.1 Mb (hap1) and 146.4 Mb (hap2), respectively. These files are in .fasta format.
However, when I integrate my HiFi (CCS) data with Hi-C information the raw output for hap1 .asm.hic.hap1.p_ctg.gfa is only 66.1 Mb, which give a .fasta file of only 65.6 Mb. How is that possible? For the other haplotype, hap2, the tool seems to work just fine, outputting a .fasta of 161.1 Mb (bigger in size than when just using HiFi data).
This is important because I think that is the root of all problems I'm having while performing quality check on the assembly, such as BUSCO and Merqury.
Let me know what can I do about that. In the meantime, I'm trying to assemble the genome again hoping for a better result.
Thanks in advance, Matteo