Chunk taxonomy table for species assignment

biobakery / biobakery_workflows

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters.

http://huttenhower.sph.harvard.edu/biobakery_workflows

Other

97 stars 33 forks source link

Chunk taxonomy table for species assignment #35

Closed tkuntz-hsph closed 3 months ago

tkuntz-hsph commented 3 months ago

DADA2 has a known issue for some versions of R where garbage collection isn't correctly run while assigning species leading to large memory requirements. This is avoided by splitting the table into chunks and running them in sequence.

ljmciver commented 3 months ago

Thank you @tkuntz-hsph ! Looks great.

Would we ever have a case where there are less than 4000 items to chunk? If so, would this be okay with a single, smaller than 4000 item chunk? I am just wondering if we need to add some code to account for that case.

tkuntz-hsph commented 3 months ago

This will result in one chunk if there are less than 4K representative sequences, so it'll run correctly. It also accounts for if the chunk size is only slightly larger than the number of sequences and will split the sequences into two roughly equivalently sized chunks.

ljmciver commented 3 months ago

Fantastic! Thank you!