fmalmeida / MpGAP

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads
https://mpgap.readthedocs.io/en/latest/
GNU General Public License v3.0
53 stars 10 forks source link

problem with longreads_only assembly #59

Closed Guy2Horev closed 4 months ago

Guy2Horev commented 6 months ago

[intergalactic_knuth] Nextflow Workflow Report.pdf

Hi,

I am trying to assemble plant genome (~800m) from PacBio Revio reads.

here is the command I use nextflow -bg run fmalmeida/mpgap --output _ASSEMBLY --max_cpus 20 --skeep_wtdbg2 --genome_size 800m --input MPGAP_samplesheet1.yml -profile docker

here is the yml file contents

samplesheet:
  - id: sample_5
    pacbio: HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz

The process started but at some points I get the error messages similar to the following for all the assemblers

[Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: LONGREADS_ONLY:canu (sample_5); status: COMPLETED; exit: 1; error: -; workDir: /mnt/data/guyh/Trifolium/Revio/work/a8/a8499e26430751241cde25981ce53b]

A pdf version of mpgap report is attached

Can you please advice?

Thank you in advance. Guy

fmalmeida commented 6 months ago

Hi @Guy2Horev , Thanks for using the pipeline.

I think your problem might be related to memory and that you have hifi reads.

I will first suggest a command line using the current released version v3.1.4 to first assess if the problem is indeed memory. If that does not work, I can advise you with another command line to try with the current dev branch in which I was adding a few modifications in the pipeline to properly work with hifi reads and trying to fix some other bugs, so, if the released branch does not work, we can try with the dev branch so I we can test and check if next release is ready.

For that, here are a few advices and inquiries:

Testing case so I can also try Can you look for a relatable public dataset in NCBI which you can point me to as a good testing case which relates to your dataset? So I can also try assembling it and assessing errors at the same time you try with your dataset?

The skip parameters and assemblers to try In your command line, you wrote skeep, instead of skip. Since your data is hifi as per the read name, I would recommend running only canu and flye which are the ones that have options for reads with better quality, thus I would recommend using:

--skip_wtdbg2 --skip_unicycler

Because you have hifi The version v3.1.4 of the pipeline unfortunately does not have options for hifi reads, this is being added in the current dev branch. However, it indeed contains options for "corrected" reads, which at least make canu and flye use some better options for assembly and avoid performing read correction which will affect high quality reads negatively.

For that, in version v3.1.4, we can use --corrected_long_reads.

Modifying memory for bigger genomes The defaults for memory in the pipeline are conservative. You can see in the base.config file that it uses in the 1st attempt 20.GB memory and runs for 24.h day.

One can modify it by passing on a custom config. For example, if you have a custom.config file with the following:

process {
    withLabel:process_assembly {
      cpus = 20
      memory = '40 GB'
      time = '72 h'
    }

    // Quast sometimes can take too long
    withName:quast {
      cpus = 10
      memory = '20 GB'
      time = '72 h'
    }
}

Having this, would make that all assembly steps runs for at least 72 hours, using 40 GB memory, and 20 CPUs. The same idea for Quast. You can adjust this with how much you think is feasible.

If you to not use this config, it is fine, in the second attempt, the pipeline you try to max it out the execution using how much you have allowed it to use with --max_cpus and --max_memory.

Finally the command line Finally, your command line you look like this. Beware that the use of the custom config is optional.

nextflow \
    run fmalmeida/mpgap \
    -r v3.1.4 -latest \
    --output _ASSEMBLY \
    --max_cpus 20 \
    --skip_wtdbg2 \
    --skip_unicycler \
    --genome_size 800m \
    --corrected_long_reads \
    --input MPGAP_samplesheet1.yml \
    -profile docker \
    -c custom_config_for_resources.config # optional

Please let me know how it goes, because then we can properly assess if there is any bug, or if for example we can try the current dev branch which already has some bug fixes and where I started to test some parameters for hifi reads, but it requires proper testing before release.

Let me know if you have a public dataset similar to yours which I can try.

Best, Felipe.

Guy2Horev commented 6 months ago

Hi Felipe,

Thank you very much for the detailed response. I tried to run the pipeline with your suggestions (had to add --skip_raven too) I defined 24 cpus and 100G memory, but flye still fails with the following error.

[2024-01-01 08:52:21] DEBUG: Sorting k-mer index [2024-01-01 08:52:34] root: ERROR: Looks like the system ran out of memory [2024-01-01 08:52:34] root: ERROR: Command '['flye-modules', 'assemble', '--reads', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/HMW_DNA_m84126_231020_112323_s3.hifi_reads.fastq.gz', '--out-asm', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/flye/00-assembly/draft_assembly.fasta', '--config', '/opt/conda/envs/mpgap-3.1/lib/python3.6/site-packages/flye/config/bin_cfg/asm_corrected_reads.cfg', '--log', '/mnt/data/guyh/Trifolium/Revio/work/75/275ec1e2e21737fab67ebcc97b2822/flye/flye.log', '--threads', '24', '--genome-size', '800000000', '--min-ovlp', '10000']' died with <Signals.SIGKILL: 9>.

I am trying to run only canu to check if it works.

fmalmeida commented 6 months ago

Hi @Guy2Horev ,

The error message still complains about memory. I would recommend trying again with more memory, or, trying the dev branch of the pipeline where you can use a parameter for hifi reads as described in another issue under testing #52

Please let me know so we can get it to work and also, if necessary, to properly test this mentioned dev branch for releasing 😄

Also, when increasing memory, is good to try to keep number of cpus stable or low, so that you have more memory per thread.

fmalmeida commented 4 months ago

Added some new parameters in the latest release to allow users to quickly modify the amount of memory of starting assembly results. Select different BUSCO dbs. And also, say if long reads are corrected or high quality.

https://github.com/fmalmeida/MpGAP/releases/tag/v3.2.0

Hope it helps.

If error persists, we can open a new ticket for tackling it.