chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
501 stars 84 forks source link

Metagenomics assembly with Hifiasm? #48

Open JeanMainguy opened 3 years ago

JeanMainguy commented 3 years ago

Hi,

I trying to assemble metagenomics hifi reads. Is Hifiasm suited for metagenomics assembly? and if so do you have any recommended settings for that purpose?

Best, Jean

chhylp123 commented 3 years ago

We are testing hifiasm with metagenomics datasets and making specific modification. But for now it might not work as well as other metagenomics-specific tools. I guess the key point is how to sample metagenomics datasets, which cannot be performed automatically by current hifiasm.

JeanMainguy commented 3 years ago

Ok thank you very much!

xfengnefx commented 3 years ago

@JeanMainguy I've made the testing fork public. You may try with hifiasm_meta -t32 -oasm --force-preovec --exp-graph-cleaning reads.fq.gz 2>STDERR.log, this includes the said read selection, and some other graph cleaning routines (contig generation hasn't been updated yet).

The default thresholds are pretty arbitrary since there's only very few datasets available. Set --lowq-10 higher if it's dropping too many reads, and lower if overlap-error correction takes too long. Please refer to the readme for other switches. May I ask what datasets were you looking at?

JeanMainguy commented 3 years ago

That's awesome, I'll try it as soon as possible. Thank you very much. I am currently playing with the ATCC MSA-1003 mock datasets from the preprint "Highly accurate long-read HiFi sequencing data for five complex genomes" (https://www.biorxiv.org/content/10.1101/2020.05.04.077180v1) but I will soon have freshly new sequenced datasets from another mock and real environment samples.

xfengnefx commented 3 years ago

@JeanMainguy We also use the ATCC MSA-1003 for dev, I think you can try pushing the read selection hard since there's so many redundancy. Getting rid of ~1/2 reads was acceptable to get the backbones of the strains (--low-q 150 as I remembered). For real datasets I'm not sure, currently we only have access to 2 and they appeared to be very different, e.g. horizontal gene transfers. It will be interesting to see how the heuristics work in more samples.

And by the way, please feel free to throw issues to the fork if you have any questions. I'll try to make the latest commit ready to go, but as a dev fork it's not yet documented or stable.