fmalmeida / MpGAP

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads
https://mpgap.readthedocs.io/en/latest/
GNU General Public License v3.0
56 stars 10 forks source link

Add hifiasm assembler #70

Closed scintilla9 closed 7 months ago

scintilla9 commented 7 months ago

Hi I've made few changes to incorporate Hifiasm for long read assembler. The environment was tested by building a local singularity sif from Dockerfile and running with -profile singularity. I can successfully generated a hifiasm-assembled genome and polished it with gcpp.

Please check if it worked, and I am not sure which branch could be used to merge, so I selected the latest developing branch.

fmalmeida commented 7 months ago

Hi @scintilla9 , Many thanks for the PR. I will review it during this week. I will probably create a new branch which I can merge your code before merging to dev, so that, if I have to modify something due conflicts, I can easily do so.

Thanks.

fmalmeida commented 7 months ago

One a very first (fast) assessment, the code looks awesome, fitting perfectly with the pipe. So, I will focus in reviewing it and merging it during this week, so I can roll-out a new release in the next 14 days. There is already quite some important fixes I added in dev which I would like to roll out asap.

fmalmeida commented 7 months ago

Hi @scintilla9 , Which testing data did you use? With the 15X subsets I have, it does not work. It generates an empty assembly file most likely due non-sufficient data. I would like to generate at least two full runs before merging, so I can confidently make the small testing profiles skip this assembler. 😄

scintilla9 commented 7 months ago

Hi @fmalmeida , I tested on my own pacbio data. After converting bam into CCS reads with ngs-preprocess, it has about 66M bases, and yes, a very high coverage dataset for the target species (~1200X).

fmalmeida commented 7 months ago

Hi @scintilla9 so I will try to find something public to test it. Thx.

Another thing is, maybe would be worthy to do one of the following?

I have some considerations on the sense that, if the tool is proper only for high quality reads and would fail on the more general (frequent) runs, then, would make sense to not always run it for everything.

What do you think? Any ideas?

scintilla9 commented 7 months ago

Hi @fmalmeida According to the instruction of Hifiasm, typically having at least 13X coverage with HiFi reads should be sufficient. Not sure why assembly failed when using a 15X subset.

I think the first option will be more straightforward, however, will the inconsistency of having most of the assemblers were run by default while one skippied by default confuse users?

fmalmeida commented 7 months ago

My subset is 15X but is not hifi, they are normal reads. And yes, I think it may confuse users so that is why I am wondering on what. If (by the name) it is only for hifi, maybe the second scenario would be the most appropriate to skip it when not running with high quality long reads.

fmalmeida commented 7 months ago

Code has been working fine. Docs were updated. Testing profile for hifi reads subset has been added. So, I will merge this to dev, and start running some testings with full datasets to guarantee it is proper for a new release.

Finally, now that hifiasm is included, I will open follow-up tickets to investigate how the pipeline can be modified to allow one to also pass on parental data or hi-c data as specified in the tool's webpage.