Closed AGI-chandler closed 1 year ago
Hi @AGI-chandler , thanks for opening the issue. I've never seen that issue, and I suspect it's a behaviour change in the new version of Nextflow. Can you try downgrading Nextflow to version 22.04 and see if this still happens?
And yes, if you're using docker
, -profile docker
should be in the Nextflow
command. It's not really important except in nextflow run
. I'll make it clearer in the documentation in the future. Thanks.
@AGI-chandler
Have a look into the nextflow.config file, I think this is because everytime the nextflow pipeline is called, the program will try to create a new folder named report_results. And since you had run the pipeline the first time with the--download_db
parameter, this folder already exist.
A simple work around is to delete or rename this report_results folder, and your subsequent command with parameter --help
should work.
@proteinosome It seems the pipeline creates a few output folders. One that stores the status of the nextflow runs and the other directory that stores the actual results. There is also a work folder that stores the intermediates. I think there is a possibility to streamline this output process.
Thanks @kevlim83, yes indeed that's what the error suggests. However, with my version of Nextflow (22.04), I actually can't reproduce it. The nextflow run main.nf --help
command on my end does not create a report folder so I think it's a recent change in behaviour in Nextflow
. Hence my suggestion for @AGI-chandler to try an older version.
Nextflow output structure is that the results folder contains symlinks to the works folder. There's a way to make real copy instead of symlink but you end up with duplicates in the "work" folder. Of course I can also make it delete the work folder after everything is completed, and that's something I may implement in the future, but at the beta stage right now I prefer to keep it that way for troubleshooting purpose. Finally, the "report" folder contains timings and resource usage of the steps, not actual results.
I hear you, thou, and will think about how best to make it more streamlined.
Thanks @proteinosome and @kevlim83
@proteinosome that old version of Nextflow only supports up to Java 17, but we have 19 installed with java-latest-openjdk-headless-fastdebug
package so I don't really want to roll back so many packages.
It sound like this folder creation is not such a big deal and I can delete it. After doing that the --help
runs fine. I could also set this dag.overwrite
option that is suggested, does that go in nextflow.conf
inside the dag
settings:
dag {
enabled = true
file = "report_$params.outdir/dag.html"
dag.overwrite = true
}
?
I will move on to testing then, with our slurm. I updated the process
settings:
process {
executor = 'slurm'
queue = 'defq'
}
but what else should be updated? Suppose we want to utilize our entire cluster which has 5 nodes totaling 1152 cpu's and 4.5 TB mem. Should I update any of the other settings?
Also this is a new and interesting way to run something... we have some real data to run on after testing is complete. nextflow
is in my path already, so can run that from our project directory for example. Does Nextflow have a path too, so it could find your main.nf
when I run nextflow run main.nf ...
? Or do I need to specify something like nextflow run ~/.local/git/pb-16S-nf/main.nf ...
? Or do I need to run nextflow
in the pb-16S-nf
git dir and then specify the other options with our project dir, e.g.: nextflow run main.nf --input <projDir>/sample.tsv --metadata <projDir>/metadata.tsv -profile docket --outdir <projDir>/results
?
Thanks
Thanks
Hi @AGI-chandler , gotcha. If that's not an issue for you then it should be fine. I'll make a note in the next release about this.
To your question, you have to specify the path of main.nf
. so nextflow run ~/.local/git/pb-16S-nf/main.nf ...
is the correct way to use it. You don't need to run the workflow inside the repo directory, just run it in your project directory. By default Nextflow
uses the nextflow.config
file in the repo directory, but if you copy that and put it in your project directory, it'll override the one in the repo directory. See the priority of the config here: https://www.nextflow.io/docs/latest/config.html
As for maximizing the cores, Nextflow handle the submission of each job according to the workflow, so e.g. if you have 99 samples, it'll submit 99 concurrent demultiplexing jobs. Some of the steps that make use of merged channels (or input if that's more intuitive for you) will only run on a single job, and you can set the CPU for that depending on how much CPU can be allocated to any job on your cluster. E.g. VSEARCH CPU can be set with --vsearch_cpu
(see help). Right now, DADA2 denoise step (--dada2_cpu
) is the bottleneck step since it uses all the samples for denoising to maximize sensitivity. So that would be where I would advice using as much CPU as you can. The other steps are pretty quick actually. I'm working on a version update that allows you to group the samples into similar types and denoise each group separately to speed it up, but that might take a while.
In nextflow.config
you can also find a block
process {
withLabel: cpu_def {
cpus = 4
memory = 16.GB
}
withLabel: cpu8 {
cpus = 8
memory = 32.GB
}
withLabel: cpu32 {
cpus = 32
memory = 128.GB
}
}
These are the default CPUs used by different process in Nextflow
specified in main.nf
, and you can increase them by changing e.g. cpus = 32
to cpus = 64
(don't change the label name as main.nf
refers to them via the label).
Hope that's clear, let me know if you run into any other trouble. Thanks!
Well we don't have that many samples! at least for our first run, only 15 since we had trouble with 1. I guess I'm still not clear on how the jobs for different steps are executed. For example, with SMRT Link I think there are basically 2 settings, nprocs and chunks or something, and we have nprocs=16 and chunks=72. So when a CCS job is run for example, it'll get broken into 72 chunks (which are the subjobs that are submitted to slurm), each one using 16 cpus, which uses all the resources possible.
The config here is different though... when you say "how much CPU can be allocated to any job on your cluster" you are talking about a single job submitted to slurm? In that case I believe it's 256 since that's the max cpus per node. So would --vsearch_cpu 256
make sense then? and the same with --dada2_cpu 256
? but if only 1 job is running then that means the other 1152-256=896 cpus are not being used? or do multiple VSEARCH and DADA2 subjobs get launched? if each one uses 256 cpus, then only 4 subjobs could run together. Since we limit our 5th node to 128 cpus, none of those would get used. So would --vsearch_cpu 128
and --dada2_cpu 128
make more sense then? Then 9 subjobs could run together and all our resources would be used.
Likewise I'm still not clear if I should increase cpu32
label to cpus = 256
or more or less, but I assume I'll scale cpu_def
and cpu8
by the same factor, meaning cpus = 32
and cpus = 64
, respectively? I'm guessing memory
should also be adjusted accordingly?
Sorry for all the questions! I'll probably disappear though once we get it tuned right... Thanks
Hi @AGI-chandler , there's no "chunking" implemented in this pipeline. Only number of CPUs for each job.
Nextflow works in a way that the jobs distribution is on a per-job basis. E.g. if you have 2 samples, and they both need to be demultiplexed, then Nextflow will submit 2 jobs in parallel to demultiplex each of these two samples. Says you now have the next job that requires merging the 2 demultiplexed samples, it will have to wait until the demux jobs finished, then it'll submit a new job that takes the input of those two demux jobs.
The number of CPUs for each job is controlled via the config file or command line parameter. When you specify --dada2_cpu 256
, the DADA2 job will request for 256 CPUs. All the other jobs are controlled by the number of CPUs set in the label section I mentioned. So some of the jobs will use 8 CPUs by default, some 16, and some 32.
And yes, since there's no chunking, you will not be able to maximize your cluster CPU. To your question on whether using 128 CPUs would make more sense or not, it probably doesn't help because VSEARCH and DADA2 steps only run a single job at any time. So even if you submit a set of 384 samples, when it reaches the DADA2 step, there will only be one job analyzing all 384 samples. This is something I'm working to optimize, but even then you should be able to analyze most data within reasonable time with the default CPU allocation. In the GitHub main page you will see some benchmark timing I put down for your reference.
Yes, scale the CPU by the same factor should work well. And adjust the memory accordingly. I usually like 4GB per CPU, but I really have not tested if lower would be fine.
Just give it a shot with a test run and see what timing do you get, then maybe double everything to see if it helps? Honestly, most steps are very very fast, so I usually only adjust --dada2_cpu
and leave the rest to default. That is the bottleneck step.
Ok thanks, well might as well try maximizing the usage as much as possible. I'll try with this in nextflow.config
:
process {
withLabel: cpu_def {
cpus = 32
memory = 128.GB
}
withLabel: cpu8 {
cpus = 64
memory = 256.GB
}
withLabel: cpu32 {
cpus = 256
memory = 1024.GB
}
}
and with 256
cpus for the 3 command line options.
$ nextflow run main.nf --input test_data/test_sample.tsv --metadata test_data/test_metadata.tsv --dada2_cpu 256 --vsearch_cpu 256 --cutadapt_cpu 256 -profile docker --outdir results
Unfortunately I can't get the test completed due to docker errors. First I did not have docker installed on the compute nodes. Then I did not have docker-rootless scripts installed on the compute nodes.
but I'm still getting this error docker: Cannot connect to the Docker daemon at unix:///run/user/10063/docker.sock. Is the docker daemon running?.
YES, isn't that what the rootless scripts are for? systemctl --user status docker
returns green marks and active (running)
on the head node and all compute nodes! Plus the main system dockerd is running on the head node and all compute nodes!
I attached the full output of the above command. This time it suggests to check .command.out (it's empty), but in the past it has suggested to check .command.run or .command.sh too (are these just random suggestions)? Either way I checked all the .command* files and didn't notice anything yet that might resolve this... I've been trying to get it to run now for a couple hours now. I'll keep working on it but maybe you might have some insights?
Well I'm still stumped and been working on this most of the day. Unfortunately it's pretty important that we get this working ASAP.
Does dockerd need to be running on compute nodes? This seems to cause some interference because I couldn't even docker run hello-world
from the head node until I systemctl --user restart docker
. Then hello-world
works, but pb-16S-nf
is still failing, this time on the pb16S:QC_fastq
command, output below. Does that mean pb16S:QC_fastq
and others are being run on a compute node? Maybe there are some special docker config steps I need to take so they can all run together on the cluster?
Error executing process > 'pb16S:QC_fastq (1)'
Caused by:
Process `pb16S:QC_fastq (1)` terminated with an error exit status (125)
Command executed:
seqkit fx2tab -j 64 -q --gc -l -H -n -i test_1000_reads.fastq.gz | csvtk mutate2 -C '%' -t -n sample -e '"test_data"' > test_data.seqkit.readstats.tsv
seqkit stats -T -j 64 -a test_1000_reads.fastq.gz | csvtk mutate2 -C '%' -t -n sample -e '"test_data"' > test_data.seqkit.summarystats.tsv
seqkit seq -j 64 --min-qual 20 test_1000_reads.fastq.gz --out-file test_data.filterQ20.fastq.gz
echo -e "test_data "$PWD"/test_data.filterQ20.fastq.gz" >> test_data_filtered.tsv
Command exit status:
125
Command output:
(empty)
Command error:
docker: Cannot connect to the Docker daemon at unix:///run/user/10063/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
Work dir:
~/.local/src/pb-16S-nf/work/31/6a3d74907f3ba8e291049c340ff490
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
Hi @AGI-chandler , yes, dockerd
needs to run on compute nodes because Nextflow
using Docker
to run the jobs on the compute nodes.
Ok thanks, that's part of the problem. I figured out how to run dockerd
in the foreground and for some reason it starts fine on the head node and 1 compute node but not the 3 remaining compute nodes. Strange because the compute nodes are all booted from a single image... so I'll continue to troubleshoot that and possibly post on the docker forums if I get stuck. Thanks for your patience!
@proteinosome Here are my latest observations. So far, I'm under the impression dockerd
can't run in a clustered environment, but this is what I'll be looking into next... have you ever run this software in a cluster before? Maybe we should be using one of the other installation options? like just the normal way, without docker or singularity?
docker swarm mode? maybe that's what we need. looking into this...
docker swarm doesn't seem to be right either... it seems like a replacement for slurm, so not right for this application.
I thought docker was a new and improved way of running apps, but seems to just complicate things in our case.
I went ahead and just removed -profile docker
from the command and let it use conda...
...and after getting conda setup and configured on the compute nodes... this worked! ✅ Pipeline Completed with the test data.
Duration : 9m 46s
CPU hours : 24.0
Succeeded : 20
I did get these warnings though, which log file is it talking about? I couldn't find any...
WARN: Failed to render execution report -- see the log file for details
WARN: Failed to render execution timeline -- see the log file for details
Now to test with our first real data...
Hi @AGI-chandler , we have Slurm clusters running Docker daemon on all nodes at two internal clusters. All I did was to install Docker on all nodes and get docker daemon running on all of them, and that's about it. Docker and Singularity are both very popular and many HPC runs either one of them.
As for the warning about execution report and timeline, I suspect it again has something to do with behaviour change in newer version of Nextflow. In the run directory there's usually a hidden .nextflow.log
, maybe look into that and see if you can find any error messages. Nonetheless those are just post-run reports and will not affect the results, so I wouldn't worry too much about it.
I see... then it must be a bug/limitation of docker rootless installation. I'm not sure why that installation method was chosen since it's been some years now.... but I think it was so users could run docker apps themselves since I can't be involved every time a user wants to run an app. I guess with your app, this is a different use of docker. I'm not yet aware of all the different ways docker can be used. Supposing I could uninstall docker rootless and go back to standard docker root mode, what are the advantages of using that over the default conda execution of pb-16S-nf that I've got working?
Thanks, yes, I see the hidden .nextflow.log
... the warnings were due to the reports already existing, probably from previous failed runs, so nothing to worry about.
You might be right. The rootless installation has always had some limitations when I tried it long time ago so I gave up and just install it with root. I think in some HPC there's always concern with the security of allowing users to run docker installed via root, so that might be a reason why your cluster has that installed?
Either way, to your question: No, there's no difference in output between the docker
mode and conda
mode. In fact, I provided these different options precisely for the reason that different users might run into different issues. I've had many users who ran into issue with Conda
(See the other open issue, for example), and user like you who has difficulty getting Docker
mode to run. The Docker
container contains the same environment and software versions provisioned from the Conda
environment yml
file, so you can choose whichever that works for you.
The pipeline was successful with our first set of real data! 🎉 🥳
Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
=========================================================
Number of samples in samples TSV: 15
Filter input reads above Q: 20
Trim primers with cutadapt: Yes
Forward primer: AGRGTTYGATYMTGGCTCAG
Reverse primer: AAGTCGTAACAAGGTARCY
Minimum amplicon length filtered in DADA2: 1000
Maximum amplicon length filtered in DADA2: 1600
maxEE parameter for DADA2 filterAndTrim: 2
minQ parameter for DADA2 filterAndTrim: 0
Pooling method for DADA2 denoise process: pseudo
Minimum number of samples required to keep any ASV: 1
Minimum number of reads required to keep any ASV: 5
Taxonomy sequence database for VSEARCH: pb-16S-nf/databases/GTDB_ssu_all_r207.qza
Taxonomy annotation database for VSEARCH: pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
Skip Naive Bayes classification: false
SILVA database for Naive Bayes classifier: pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
GTDB database for Naive Bayes classifier: pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
RefSeq + RDP database for Naive Bayes classifier: pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
VSEARCH maxreject: 100
VSEARCH maxaccept: 100
VSEARCH perc-identity: 0.97
QIIME 2 rarefaction curve sampling depth: null
Number of threads specified for cutadapt: 256
Number of threads specified for DADA2: 256
Number of threads specified for VSEARCH: 256
Script location for HTML report generation: pb-16S-nf/scripts/visualize_biom.Rmd
Container enabled via docker/singularity: false
Version of Nextflow pipeline: 0.4
Time 1d 6h 20m 21s
CPU-Hours 6,478.5
Will close the issue then.
Congrats! That takes really long for just 15 samples, thou. Are these environmental samples? Do you know on average how many reads are there per sample (These should be in the final HTML report in the results folder)? Those are the ones that usually cause very long run time. I am wondering if using 256 CPUs somehow caused one of the steps to use too much memories and it started swapping. If you have some time perhaps try the same set of samples with 32/64 CPUs and see if that changes anything.
Not sure what you mean by "environmental samples" but I'll find out.
As far as the reads go, here are the stat's:
To be honest, I had the pipeline running in the foreground for about 4 hours I want to say, and then my desktop session crashed, which disconnected me from the server and killed the pipeline... it was running the dada2 step, I'm not sure if that gets resumed in the middle or if it started from the beginning of that step again, but that might somewhat explain the long run time.
The results don't indicate much memory was used for that step, there is 1 TiB available... dada2_denoise median physical memory usage=32G, median Virtual Memory Usage=307G
Hello, @AGI-chandler and I are working on this dataset, thank you both for allowing the pipeline to run in our cluster!
The data we run was from a single SMRT cell 8M, that yielded 5.7 million reads (8.3 Gb). There were 15 samples from 4 different projects, 4 or 3 samples each. Two projects were from environmental samples (soil and some kind of swamp I think), the other two may have been human microbiomes, but I am not sure. Do you think that when we have e.g. 96 samples in one SMRT cell (so less reads/sample) the DADA2 step will take less time? This run was just to test infrastructure, software, and see the type of output. Next we will run separately the 4 projects. Thanks, Dario
I've had success with the installation up to this point. Nexflow-edge was installed by simply downloading and running the script, since we have newer Java 19 on the system.
nextflow run main.nf --download_db -profile docker
was successful but thennextflow run main.nf --help
was not, as you can see below. I tried adding-profile docker
too since I think I'm supposed to use that in every command now right? The instructions were unclear. Either way, the error was the same.