cschin / Peregrine

Peregrine: Fast Genome Assembler Using SHIMMER Index
Other
99 stars 9 forks source link

Human genome assembly parameters #11

Open kautto opened 5 years ago

kautto commented 5 years ago

I'm trying to run an assembly of a human genome with ~30x coverage with Nanopore reads. We're primarily running r5.24xlarge instances on Amazon, which have 96 cores and 768Gb of memory.

For the 48 CPU/384 gig instances, it looks like using 24 chunks and CPUs was recommended - but since I'm not really sure how the chunk-to-CPU correlation works, I'm not sure what the best parameters would be for 96/768. Would you recommend just doing 48 for all of them, or would there be something that balances the memory and CPU usage better?

Sorry if it's a dumb question! Thanks!

cschin commented 5 years ago

@kautto The assembler is designed for accurate long reads (~10kb length < 1% error). Do you have such reads from Nanopore. If so, it should work. If not, maybe we need to wait a bit for Nanopore comes up a protocol to generate such data.

kautto commented 5 years ago

Hi @cschin , the ONT data is obviously still high-error, even with the "high-accuracy" model that I have it base called on. Our mean read length is ~20kb, so I had planned on trying it out regardless to see if it could still give me results with the noisier data. Even if the results are of poorer quality and/or it takes longer to assemble, do you think in principle it should work on noisier data? I'm having issues getting it to run, but those might be completely unrelated to the data itself (and is probably best left for a separate issue).

cschin commented 5 years ago

@kautto well, the current method won't well for lower accuracy reads. I can go over some details when there is a chance. Maybe we will support lower accuracy reads in the future. If you can't get it to run, that is a different issue and we should problem move to a different ticket for that.

kautto commented 5 years ago

Sounds good! We're also planning to test this out on Sequel CCS reads - for that, would you recommend any specific parameters? The chunk/memory to CPU correlation is what I'm unclear on, so I'm not sure how to optimize the numbers.

RE: issues running the software, I'll troubleshoot further and open another issue if needed.

cschin commented 5 years ago

For r5.24xlarge instances on Amazon, you can use the defaults from the readme:

This example will work well for about 30x human genome on r5.24xlarge:

find /wd/fastq/ -name "*.fastq" | sort > seqdata.lst 

docker run -it -v /wd:/wd --user $(id -u):$(id -g) cschin/peregrine:0.1.5.3 asm \
    /wd/seqdata.lst 24 24 24 24 24 24 24 24 24 \ 
    --with-consensus --shimmer-r 3 --best_n_ovlp 8 \ 
    --output /wd/asm-r3

It will use up to 24 core concurrently for most step except the overlaps to graph which is single core operation and likely to consumer about 200G to 300G (about 90G from cached sequenced data.) Perhaps you can push to 32 cores perhaps with 400G RAM.