a-ludi / dentist

Close assembly gaps using long-reads at high accuracy.
https://a-ludi.github.io/dentist/
MIT License
47 stars 6 forks source link

Getting started #12

Closed A-J-F-Mackintosh closed 3 years ago

A-J-F-Mackintosh commented 3 years ago

Hi,

I am trying to use dentist for the first time but am having some trouble getting started. I am running dentist using singularity and have snakemake version 6.0.0 installed.

I downloaded the dentist.json and snakemake.yml files and edited them to include the relevant paths and also some options mentioned in the README (see below).

I first tried to validate the config files using the recommended command.

snakemake --configfile=snakemake.yml --use-singularity --cores=32 -f -- validate_dentist_config

INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 32
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1       validate_dentist_config
    1
Select jobs to execute...

[Mon Mar  8 15:52:21 2021]
localrule validate_dentist_config:
    input: dentist.json
    jobid: 0

INFO:    Using cached SIF image
INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Job counts:
    count   jobs
    1       validate_dentist_config
    1
[Mon Mar  8 15:52:31 2021]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /scratch/amackintosh/DENTIST_02/.snakemake/log/2021-03-08T155213.682050.snakemake.log

All seemed to work fine, so I then tried to run it.

snakemake --configfile=snakemake.yml --use-singularity --cores=32

INFO:    Convert SIF file to sandbox...
INFO:    Cleaning up image...
Building DAG of jobs...
MissingInputException in line 1091 of /scratch/amackintosh/DENTIST_02/Snakefile:
Missing input files for rule ref_vs_reads_alignment_block:
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dentist-self.data
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dust.anno
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dentist-self.anno
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.tan.data
/scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.tan.anno
scratch/amackintosh/DENTIST_02/.brenthis_ino.SP_BI_364.v1_1.contigs.dust.data

I am not used to using snakemake but I assume the missing input files are because a preceding process could not be executed. Is it possible that the problem lies within how I filled out the json and yaml files? The part of the json I edited the most looks like this (below), could any of these options being causing problems?

    "// This is a comment and will be ignored": [
    "You must set at least either `ploidy` and `read-coverage`",
    "or `max-coverage-reads` and `min-coverage-reads`."
    ],
    "__default__": {
        "read-coverage": 66.9,
        "min-reads-per-pile-up": 3,
        "min-spanning-reads": 3,
        "join-policy": "contigs",
        "ploidy": 2,
        "max-coverage-self": 3,
        "verbose": 2,

Any help would be really appreciated.

Best,

Alex

a-ludi commented 3 years ago

Hi Alex,

please try removing all dots (.) from the assembly file (brenthis_ino.SP_BI_364.v1_1.contigs.fasta). The dots are used as a separator in the hidden .anno and .data files and my confuse the workflow.

I will see if I can fix the workflow so it works with dots in the FASTA file names.

Should you have more issues, you may try running the small example before continuing with your real data. If you need more help, please do not hesitate to ask here.

-- Arne

A-J-F-Mackintosh commented 3 years ago

Hi Arne,

Many thanks for the speedy reply.

I ran the example and it finished without any problems.

I then changed the paths (which are symlinks) in the snakemake.yml file so that they do not contain any dots.

inputs:
    # The reference assembly where gaps should be closed
    reference:          brenthis_ino_assembly
    # The set of long reads used for gap closing
    reads:              brenthis_ino_reads
    # Type of reads. Use `PACBIO_SMRT` or `OXFORD_NANOPORE`. See README for                                                                                                 
    # more details on the subject.
    reads_type:         PACBIO_SMRT

outputs:
    # The gap-closed reference assembly                                                                                                                                     
    output_assembly:    brenthis_ino_dentist_assembly

This produced a new error message from snakemake.

[Mon Mar  8 21:33:04 2021]
Error in rule reference2dam:
    jobid: 2
    output: /scratch/amackintosh/DENTIST_02/brenthis_ino_assembly.dam,
/scratch/amackintosh/DENTIST_02/.brenthis_ino_assembly.bps, /scratch/amackintosh/DENTIST_02/.brenthis_ino_assembly.hdr, 
/scratch/amackintosh/DENTIST_02/.brenthis_ino_assembly.idx
    shell:
        fasta2DAM /scratch/amackintosh/DENTIST_02/brenthis_ino_assembly.dam brenthis_ino_assembly && DBsplit -x1000 -a
-s200 /scratch/amackintosh/DENTIST_02/brenthis_ino_assembly.dam
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Mar  8 21:33:04 2021]
Error in rule reads2db:
    jobid: 8
    output: /scratch/amackintosh/DENTIST_02/brenthis_ino_reads.db,
/scratch/amackintosh/DENTIST_02/.brenthis_ino_reads.bps, /scratch/amackintosh/DENTIST_02/.brenthis_ino_reads.idx
    shell:
        fasta2DB /scratch/amackintosh/DENTIST_02/brenthis_ino_reads.db brenthis_ino_reads && DBsplit -x1000 -a 
-s200 /scratch/amackintosh/DENTIST_02/brenthis_ino_reads.db
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /scratch/amackintosh/DENTIST_02/.snakemake/log/2021-03-08T213253.788128.snakemake.log

As I said before, I do not understand snakemake very well, so am not sure exactly what the error message means. A problem with fasta2DAM and fasta2DB?

Best,

Alex

a-ludi commented 3 years ago

Hmm, there is no specific error message in the log. Probably it was issued a bit earlier.

The easiest way of fixing things will probably be to rm -rf workdir. At this point we don't loose anything substantial.

A-J-F-Mackintosh commented 3 years ago

Hi,

I managed to fix the above issues. One problem was that the sequences in the assembly.fasta must be multi-line rather than single. The other problem was that the gzipped reads cannot be read without .gz in the filename, but the .gz causes issues because of the extra dot, so I had to unzip them.

I have now managed to run dentist successfully with a small subset of the reads (<1%).

I then tried to run dentist with the whole read set, however this causes damapper to error. This error persists when using either docker://aludi/dentist:v1.0.1 or docker://aludi/dentist:stable.

Error in rule ref_vs_reads_alignment_block:
    jobid: 415
    output: /scratch/amackintosh/DENTIST_02/brenthis_ino_assembly_wrapped.brenthis_ino_reads.114.las,
    /scratch/amackintosh/DENTIST_02/brenthis_ino_reads.114.brenthis_ino_assembly_wrapped.las
    log: /scratch/amackintosh/DENTIST_02/ref-vs-reads-alignment.114.log (check log file(s) for error message)
    shell:

        {
        cd /scratch/amackintosh/DENTIST_02/
        damapper -C '-T32' -e0.7 -mdust -mdentist-self -mtan brenthis_ino_assembly_wrapped brenthis_ino_reads.114
        LAcheck -v brenthis_ino_assembly_wrapped brenthis_ino_reads
        brenthis_ino_assembly_wrapped.brenthis_ino_reads.114.las || { echo 'Check failed. Possible solutions:

Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.

In order to ignore checks entirely you may use the environment variable SKIP_LACHECK=1. Use only if you are positive the
files are in fact OK!'; (( ${SKIP_LACHECK:-0} != 0 )); }
    LAcheck -v brenthis_ino_reads brenthis_ino_assembly_wrapped brenthis_ino_reads.114.brenthis_ino_assembly_wrapped.las ||
    { echo 'Check failed. Possible solutions:
Duplicate LAs: can be fixed by LAsort from 2020-03-22 or later.

In order to ignore checks entirely you may use the environment variable SKIP_LACHECK=1. Use only if you are positive the
files are in fact OK!'; (( ${SKIP_LACHECK:-0} != 0 )); }
    } &> /scratch/amackintosh/DENTIST_02/ref-vs-reads-alignment.114.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

It looks like I could either change my damapper/LAsort version (not sure how), or pass the environmental variable SKIP_LACHECK=1 (through the snakemake.yaml?). What would you recommend?

Best,

Alex

a-ludi commented 3 years ago

I am glad that you could solve the issues. I will definitely try to build some checks and better handling for gzipped input files .

Regarding your last error: please try with SKIP_LACHECK=1 by passing it like this:

SKIP_LACHECK=1 snakemake --configfile=snakemake.yml --use-singularity --cores=32

It is most likely the source of the error even though I cannot tell because the log does not contain the error message but just the command to would issue the message. I will also try to remove the error message from the shell command as to avoid confusion.

A-J-F-Mackintosh commented 3 years ago

Hi,

SKIP_LACHECK=1 allowed the analysis to complete without any problems, many thanks.

I will now start playing around with parameters to see how dentist can improve my assembly!

Thanks again for all your help,

Alex