Closed Guy2Horev closed 4 months ago
Hi @Guy2Horev , Thanks for using the pipeline.
I think your problem might be related to memory and that you have hifi reads.
I will first suggest a command line using the current released version v3.1.4
to first assess if the problem is indeed memory. If that does not work, I can advise you with another command line to try with the current dev branch
in which I was adding a few modifications in the pipeline to properly work with hifi
reads and trying to fix some other bugs, so, if the released branch does not work, we can try with the dev branch
so I we can test and check if next release is ready.
For that, here are a few advices and inquiries:
Testing case so I can also try Can you look for a relatable public dataset in NCBI which you can point me to as a good testing case which relates to your dataset? So I can also try assembling it and assessing errors at the same time you try with your dataset?
The skip parameters and assemblers to try
In your command line, you wrote skeep
, instead of skip
. Since your data is hifi
as per the read name, I would recommend running only canu
and flye
which are the ones that have options for reads with better quality, thus I would recommend using:
--skip_wtdbg2 --skip_unicycler
Because you have hifi
The version v3.1.4
of the pipeline unfortunately does not have options for hifi
reads, this is being added in the current dev branch
. However, it indeed contains options for "corrected" reads, which at least make canu
and flye
use some better options for assembly and avoid performing read correction which will affect high quality reads negatively.
For that, in version v3.1.4
, we can use --corrected_long_reads
.
Modifying memory for bigger genomes
The defaults for memory in the pipeline are conservative. You can see in the base.config file that it uses in the 1st attempt 20.GB
memory and runs for 24.h
day.
One can modify it by passing on a custom config. For example, if you have a custom.config
file with the following:
process {
withLabel:process_assembly {
cpus = 20
memory = '40 GB'
time = '72 h'
}
// Quast sometimes can take too long
withName:quast {
cpus = 10
memory = '20 GB'
time = '72 h'
}
}
Having this, would make that all assembly steps runs for at least 72 hours, using 40 GB memory, and 20 CPUs. The same idea for Quast. You can adjust this with how much you think is feasible.
If you to not use this config, it is fine, in the second attempt, the pipeline you try to max it out the execution using how much you have allowed it to use with --max_cpus
and --max_memory
.
Finally the command line Finally, your command line you look like this. Beware that the use of the custom config is optional.
nextflow \
run fmalmeida/mpgap \
-r v3.1.4 -latest \
--output _ASSEMBLY \
--max_cpus 20 \
--skip_wtdbg2 \
--skip_unicycler \
--genome_size 800m \
--corrected_long_reads \
--input MPGAP_samplesheet1.yml \
-profile docker \
-c custom_config_for_resources.config # optional
Please let me know how it goes, because then we can properly assess if there is any bug, or if for example we can try the current dev branch which already has some bug fixes and where I started to test some parameters for hifi
reads, but it requires proper testing before release.
Let me know if you have a public dataset similar to yours which I can try.
Best, Felipe.
Hi Felipe,
Thank you very much for the detailed response. I tried to run the pipeline with your suggestions (had to add --skip_raven too) I defined 24 cpus and 100G memory, but flye still fails with the following error.
[2024-01-01 08:52:21] DEBUG: Sorting k-mer index [2024-01-01 08:52:34] root: ERROR: Looks like the system ran out of memory [2024-01-01 08:52:34] root: ERROR: Command '['flye-modules', 'assemble', '--reads', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz', '--out-asm', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/flye/00-assembly/draft_assembly.fasta', '--config', '/opt/conda/envs/mpgap-3.1/lib/python3.6/site-packages/flye/config/bin_cfg/asm_corrected_reads.cfg', '--log', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/flye/flye.log', '--threads', '24', '--genome-size', '800000000', '--min-ovlp', '10000']' died with <Signals.SIGKILL: 9>.
I am trying to run only canu to check if it works.
Hi @Guy2Horev ,
The error message still complains about memory. I would recommend trying again with more memory, or, trying the dev branch of the pipeline where you can use a parameter for hifi reads as described in another issue under testing #52
Please let me know so we can get it to work and also, if necessary, to properly test this mentioned dev branch for releasing 😄
Also, when increasing memory, is good to try to keep number of cpus stable or low, so that you have more memory per thread.
Added some new parameters in the latest release to allow users to quickly modify the amount of memory of starting assembly results. Select different BUSCO dbs. And also, say if long reads are corrected or high quality.
https://github.com/fmalmeida/MpGAP/releases/tag/v3.2.0
Hope it helps.
If error persists, we can open a new ticket for tackling it.
[intergalactic_knuth] Nextflow Workflow Report.pdf
Hi,
I am trying to assemble plant genome (~800m) from PacBio Revio reads.
here is the command I use
nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker
here is the yml file contents
The process started but at some points I get the error messages similar to the following for all the assemblers
[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]
A pdf version of mpgap report is attached
Can you please advice?
Thank you in advance. Guy