Add assembly Snakefile + singularity - Githubissues

dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.

https://dahak-metagenomics.github.io/dahak

BSD 3-Clause "New" or "Revised" License

21 stars 4 forks source link

Add assembly Snakefile + singularity #91

Closed charlesreid1 closed 6 years ago

charlesreid1 commented 6 years ago

This pull request builds on #83 (add read filtering and taxonomic classification workflows). This PR should happen together with #83.

Changes implemented in this pull request are documented in the assembly documentation. The assembly documentation is added in #92 (add assembly documentation). So, this should also happen together with #92.

Merge Checklist

[x] Typos are a sign of poorly maintained code. Has this request been checked with a spell checker?
[] ~Tutorials should be universally reproducible. If this request modifies a tutorial, does it assume we're starting from a blank Ubuntu 16.04 (Xenial Xerus) image?~ (N/A)
[] ~Large diffs to binary or data files can artificially inflate the size of the repository. Are there large diffs to binary or data files, and are these changes necessary?~ (N/A)

charlesreid1 commented 6 years ago

There' s been a bit of a lull in the commits on this PR because testing assembly takes a really long time. But things are still going okay so far with both the megahit and metaspades workflows.

UPDATE: The metaspades test failed b/c it ran out of memory on an m5.2xlarge (which has 8 procs and 32 GB of memory). The program printed out at one point that it would require 28 GB, but not sure if it ran out of memory b/c that was just an estimate, or if a later step increased memory requirements. In any case, we are re-running the metaspades on an m5.4xlarge (16 threads, 64 GB memory). That's pretty beefy.

charlesreid1 commented 6 years ago

UPDATE: The metaspades test failed again on an m5.4xlarge (16 procs, 64 GB memory) due to same problem - raising an exception about memory allocation. This time it took ~24 hours to encounter the error. The confusing part is that metaspades is claiming it only needs about 28 GB of memory. I plan to re-run this job using a high-memory AWS instance, so we can crank up the amount of memory.

brooksph commented 6 years ago

Which data set(s) are running this on? Complete or subset? I think we can test on the 10% subset.

charlesreid1 commented 6 years ago

As per Slack conversation with @brooksph, I was using the full reads instead of the subsampled reads. 🤦‍♂️

charlesreid1 commented 6 years ago

Tests (megahit and metaspades) were both successful. This PR is ready to merge.

brooksph commented 6 years ago

Assembly with MEGAHIT is good to go. Spades is still running.

brooksph commented 6 years ago

SPAdes is also ready. So, that's read filtering and assembly ready to go following successful completion using the scripts on this branch. We also have the approval to store the kaiju databases on s3 which should resolve our taxonomic classification workflow issue.

charlesreid1 commented 6 years ago

I'll go ahead and add the kaiju database to an S3 bucket now...

Charles

On Sun, Jul 1, 2018 at 2:59 PM, Phillip Brooks notifications@github.com wrote:

SPAdes is also ready. So, that's read filtering and assembly ready to go following successful completion using the scripts on this branch. We also have the approval to store the kaiju databases on s3 which should resolve our taxonomic classification workflow issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dahak-metagenomics/dahak/pull/91#issuecomment-401636221, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWdy6TwNWZAzwXvntQ6SD4_Gqer68iMks5uCUZMgaJpZM4Um6WE .

brooksph commented 6 years ago

Closed in favor of https://github.com/dahak-metagenomics/dahak/pull/95