a-h-b / dadasnake

Amplicon sequencing workflow heavily using DADA2 and implemented in snakemake
GNU General Public License v3.0
45 stars 17 forks source link

Rule `dada_dadaSingle` works only in a single-thread #6

Closed vmikk closed 3 years ago

vmikk commented 3 years ago

Hello!

We have 10 sequencing runs, each with ~100 samples. We run Dadasnake on a desktop (in -l mode). Dadasnake works pretty well at the first stages (filtering and error estimation), however when it comes to the dada_dadaSingle rule, it switches to the sequential analysis of samples (not in parallel). If we terminate and resume the workflow, Snakemake starts 8 processes at first, but after they finish it proceeds in single-thread mode only (one sample at time). However, it should be enough of resources to proceed with all 8 cores.

The command we are using:

dadasnake -t 8 -l  config.pacbioCCS_vm.yaml

with

big_data: false
dada:
  pool: false

so the main sub-workflow is dada.single.smk.

I've tried to remove the resources section in the rules, and to decrease NORMAL_MEM_EACH to 3G in VARIABLE_CONFIG. But it does not help. Could you please tell us where the problem could be?

With kind regards, Vladimir

a-h-b commented 3 years ago

Hi Vladimir - thanks for your question. I am not quite sure I understand the situation yet.... let me recapitulate what I understood: you run dadasnake, giving it 8 threads at the step where it's supposed to make the ASVs, you would expect to run 8 samples at any time, but it only runs 1 sample? to be honest, if that's true, I don't really know why it's happening. Just a few checks, to make sure we're on the same page: you have the latest dadasnake version? and you haven't limited the number of threads in the config file or the VARIABLE_CONFIG? and you have the samples and runs defined in the sample table as described, so there are approx 1000 samples and each belongs to one of the 10 runs? If all that is correct, then my guess would be that there is some other rule running in addition to the 1 sample. Do you think you could run a test with 8 samples and post here the output it prints? In any case, the changes you've made to the memory resources will not have any effect on the execution of the rules, these settings are only used to pass to a scheduler. Best wishes - Anna

vmikk commented 3 years ago

Hello Anna! Thank you for the fast response! Yes, you understood me correctly.

I'm using the latest dadasnake (git commit 8405a39)

I've limited the number of threads in VARIABLE_CONFIG to 12:

SNAKEMAKE_VIA_CONDA true
LOADING_MODULES 
SUBMIT_COMMAND  
SCHEDULER   uge
MAX_THREADS 12
BIGMEM_CORES    12  
BIGMEM_MEM_EACH 30G
NORMAL_MEM_EACH 8G
LOCK_SETTINGS   true    

In htop I see that there is only one active process running (dada_dadaReads.single.R). However there are the other sleeping Snakemake processes (green on the picture) which remain after their descendant dada_dadaReads.single.R is finished. htop

I haven't noticed them at first because they are at the end of the htop list (no CPU activity).

The most puzzling is that the previous steps (e.g., filtering) worked perfectly in parallel.

So it's probably not a dadasnake issue, but something related to Snakemake. I will try to figure out what's going on.

With kind regards, Vladimir

a-h-b commented 3 years ago

Dear Vladimir - yes, it looks like a snakemake problem. Please let me know if you find something. Best wishes - Anna

vmikk commented 3 years ago

Hello Anna!

It seems that this is an old and unsolved problem of Snakemake (e.g., mentioned on StackOverflow here). The reason is probably that Snakemake checks the successfully completed jobs before continuing to the next batch of jobs. And in the case when there are a lot of tasks to be done, this phase could be quite slow.

I've tried also with the updated version of Snakemake v.5.30.1 (in the changelog they mentioned that the scheduler was improved) - but the problem remains.

With kind regards, Vladimir

a-h-b commented 3 years ago

Cool, thanks for checking this out, Vladimir!