a-ludi / dentist

Close assembly gaps using long-reads at high accuracy.
https://a-ludi.github.io/dentist/
MIT License
47 stars 6 forks source link

Setting dentist parameters #22

Open Adamtaranto opened 3 years ago

Adamtaranto commented 3 years ago

When setting the parameters below, do these need to be included in the dentist.json config file? and if so in which section?

--max-insertion-error --min-anchor-length --min-reads-per-pile-up --min-spanning-reads --allow-single-reads --join-policy

a-ludi commented 3 years ago

Hi,

yes, dentist.json is the place. You can place them all under __default__ which means that they apply to every stage. Please remove the leading dashes (--), e.g.:

{
    "__default__":  {
        "max-insertion-error": 0.15,
        "min-anchor-length": 1500,
        "min-reads-per-pile-up": 1,
        "min-spanning-reads": 1,
        "allow-single-reads": true,
        "join-policy": "contigs"
    }
}

It seems like you want to create a more greedy configuration. Please be aware that you should not further reduce min-anchor-length as it is already rather sensitive to repeat masking errors – it may even lead to reduced number of closed gaps because more scaffolding conflicts appear. Finally, here is an example for a greedy configuration I am intending to put into the next release:

{
    "// WARNING": [
        "Use with care!",
        "Always validate the closed gaps (e.g. manual inspection)."
    ],
    "// NOTE": [
        "The workflow creates an intermediate assembly",
        "`workdir/{output_assembly}-preliminary.fasta` that contains all",
        "closed gaps, i.e. before validation. It is accompanied by an AGP",
        "and BED file. You may inspect these file for maximum sensitivity."
    ],
    "__default__": {
        "verbose": 2,
        "agp": "true",
        "allow-single-reads": true,
        "best-pile-up-margin": 1.5,
        "existing-gap-bonus": 3.0,
        "join-policy": "contigs",
        "min-reads-per-pile-up": 1,
        "min-spanning-reads": 1,
        "proper-alignment-allowance": 500
    },
    "// Uncomment the following block if": [
        "the default value for min-coverage-reads",
        "(0.5 * --read-coverage/--ploidy) is suboptimal. It is important to",
        "revert --read-coverage and --ploidy because they are mutually exclusive",
        "with --min-coverage-reads."
    ]
}
a-ludi commented 3 years ago

@Adamtaranto Did this help? BTW you should consider updating DENTIST to v2.0.0. :wink:

oushujun commented 2 years ago

hello @a-ludi,

I am trying to run DENTIST with the greedy mode using the example files. I used the dentist.greedy.yml configuration to replace the original config:

cat dentist.greedy.yml
__default__:
    verbose: 2
    allow-single-reads: true
    best-pile-up-margin: 1.5
    existing-gap-bonus: 3.0
    join-policy: contigs
    min-reads-per-pile-up: 1
    min-spanning-reads: 1
    proper-alignment-allowance: 500

This is the snakemake config:

cat snakemake.yml
full_validation: true
dentist_container: dentist_3.0.0.sif
dentist_env: envs/dentist_v3.yml
dentist_config:         dentist.greedy.yml
inputs:
    reference:          reference.fasta
    reads:              reads.fasta
    reads_type:         PACBIO_SMRT
outputs:
    output_assembly:    gap-closed.fasta
reference_dbsplit:
    - -x1000
    - -a
    - -s50
reads_dbsplit:
    - -x1000
    - -a
    - -s50
workdir:            workdir_greedy_test
logdir:             logs
threads_per_process:  20
propagate_batch_size: 14
batch_size:         50
validation_blocks: 2

I then run it under the local mode but encounter the following errors:

PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all
Exception in line 434 of /home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile:
must specify either --read-coverage or --max-coverage-reads for command `mask-repetitive-regions`; must specify either --read-coverage and --ploidy or --min-coverage-reads for command `validate-regions
  File "/home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile", line 796, in <module>
  File "/home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile", line 434, in full_validate_dentist_config

Can you give me some help? Thank you very much!

Best, Shujun

a-ludi commented 2 years ago

Hi Shujun,

just add a line with the proper read coverage below __default__ to dentist.greedy.yml:

    read-coverage: {coverage}

See https://github.com/a-ludi/dentist#how-to-choose-dentist-parameters for more details.

Cheers,

Arne

oushujun commented 2 years ago

Dear Arne @a-ludi,

Thank you for your suggestion. I added read-coverage: 1 and encountered the following errors:

$ PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all Exception in line 434 of /data/mschatz1/oushujun/dentist-example/Snakefile: must specify either --read-coverage and --ploidy or --min-coverage-reads for command `validate-regions File "/data/mschatz1/oushujun/dentist-example/Snakefile", line 796, in File "/data/mschatz1/oushujun/dentist-example/Snakefile", line 434, in full_validate_dentist_config

I then added min-coverage-reads: 1 to dentist.greedy.yml and encountered the following errors:

Building DAG of jobs... MissingInputException in line 1133 of /data/mschatz1/oushujun/dentist-example/Snakefile: Missing input files for rule ref_vs_reads_alignment_block: output: workdir/M82.bac.210614.mod.m82_ont_polished.1.las, workdir/m82_ont_polished.1.M82.bac.210614.mod.las wildcards: block_reads=1 affected files: workdir/.M82.bac.210614.mod.dentist-self.anno workdir/.M82.bac.210614.mod.dentist-self.data workdir/.M82.bac.210614.mod.dust.data workdir/.M82.bac.210614.mod.tan.anno workdir/.M82.bac.210614.mod.dust.anno workdir/.M82.bac.210614.mod.tan.data

Do you have any ideas to move forward?

Thank you, Shujun

a-ludi commented 2 years ago

Hi Shujun,

I forgot you need to add ploidy: 2 , too. Typically, this is 2 please adjust to your genome assembly. Please remove min-coverage-reads to avoid further error messages.

Cheers, Arne

oushujun commented 2 years ago

HI Arne,

Thanks for your instant response. I removed min-coverage-reads and added ploidy: 2, it has the exact same errors as the one with min-coverage-reads. Did the configure file skip any critical steps of dentist?

Best, Shujun

a-ludi commented 2 years ago

Please try starting over from the beginning by removing all outputs with rm -r workdir.

oushujun commented 2 years ago

I removed the workdir and also .snakemake and rerun the command PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all, still having the same error. Any ideas?

a-ludi commented 2 years ago

Oh, yes. You have got dots in your filenames which confuse some tools in the workflow. Please replace all dots (except the suffix .fastao. course) by something else, eg. an underscore. That should do the trick.

oushujun commented 2 years ago

Thank you for this trick! Yes, it's running now.

oushujun commented 2 years ago

Hi Arne,

Thank you for your guidance. I have finished DENTIST on the contig mode (patching scaffolds with assembled contigs), but no gaps were filled. Are there any log files I can check if DENTIST is running correctly? Or are there any parameters I can change to make it work better? Thanks!

Best, Shujun

a-ludi commented 2 years ago

Hi @oushujun,

please follow the discussion in #33 for information on that. I am going to post more detailed information about digging into non-closed gaps there.

oushujun commented 2 years ago

Hi @a-ludi,

Thank you for the instructions. Here are the log files you mentioned. logs.tar.gz.

The following are the configurations used:

snakemake.yml

full_validation: true
dentist_container: dentist_3.0.0.sif
dentist_env: envs/dentist_v3.yml
dentist_config:         dentist.greedy.yml
inputs:
    reference:          /home/sou6/oushujun/M82/M82_bac_210614_mod.fasta
    reads:              /home/sou6/oushujun/M82/m82_ont_polished.fasta
    reads_type:         OXFORD_NANOPORE
outputs:
    output_assembly:    gap-closed.fasta
reference_dbsplit:
    - -x1000
    - -a
    - -s50
reads_dbsplit:
    - -x1000
    - -a
    - -s50
workdir:            workdir
logdir:             logs
threads_per_process:  20
propagate_batch_size: 14
batch_size:         50
validation_blocks: 2

dentist.greedy.yml

__default__:
    verbose: 2
    read-coverage: 1
    ploidy: 2
    allow-single-reads: true
    best-pile-up-margin: 1.5
    existing-gap-bonus: 3.0
    join-policy: contigs
    min-reads-per-pile-up: 1
    min-spanning-reads: 1
    proper-alignment-allowance: 500

Command used to run DENTIST: PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all

Please kindly let me know if you spot anything incorrectly specified or parameters I can used to improve. Thank you!

Best, Shujun

a-ludi commented 2 years ago

Hi Shujun,

I generated a report using the new Python script (see below). The main issue is "consensus alignment is invalid". That is not conclusive on its own but hints that there might be a bug in DENTIST. Could you please add the following lines (replace the path with something reasonable) to your dentist.yml and re-run?

process:
    tmpdir: /path/to/persistent/temp/directory
    keep-temp: True

Then I am interested in the contents of that tmpdir. Could you share that with me? If is it unreasonably big, share the log files first and I can tell you exactly which files I need.


In this run of DENTIST 108 potentially closable gaps were not closed. More details:

Hint: use DBshow -n workdir/[REFERENCE].dam | cat -n to translate contig numbers to FASTA coordinates.

oushujun commented 2 years ago

Hi Arne,

Thank you very much for checking my results and providing suggestions to debug. I am not very familar with snakemake, and I want to make sure I am doing it correctly. So I added the three lines to dentist.greedy.yml:

__default__:
    verbose: 2
    read-coverage: 1
    ploidy: 2
    allow-single-reads: true
    best-pile-up-margin: 1.5
    existing-gap-bonus: 3.0
    join-policy: contigs
    min-reads-per-pile-up: 1
    min-spanning-reads: 1
    proper-alignment-allowance: 500
process:
    tmpdir: /home/sou6/oushujun/M82/dentist-example/temp
    keep-temp: True

I rerun DENTIST and encounter the following error:

$ Exception in line 434 of /data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile: Error: invalid key __process__ in config File "/data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile", line 796, in File "/data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile", line 434, in full_validate_dentist_config

Then I removed the process: line showing as:

__default__:
    verbose: 2
    read-coverage: 1
    ploidy: 2
    allow-single-reads: true
    best-pile-up-margin: 1.5
    existing-gap-bonus: 3.0
    join-policy: contigs
    min-reads-per-pile-up: 1
    min-spanning-reads: 1
    proper-alignment-allowance: 500
    tmpdir: /home/sou6/oushujun/M82/dentist-example/temp
    keep-temp: True

DENTIST can be run with this yml file, the run is on going (84/457 steps), but the /temp directory is still empty. Is this the right way to do it?

Thanks, Shujun

a-ludi commented 2 years ago

Hi Shujun,

that first error was my mistake, sorry. The command names in the config file cannot be abbreviated so it requires process-pile-ups instead of just process.

It is expected that the temp directory remains empty for quite a while because the first steps in the workflow do not involve the dentist executable. Just let it continue.

You can just send me the directories /home/sou6/oushujun/M82/dentist-example/temp/dentist-process-pile-ups-*. The rest is irrelevant for now.

oushujun commented 2 years ago

HI Arne,

Thanks for the extra information. I added the process-pile-ups lines to dentist.greedy.yml as follows:

__default__:
    verbose: 2
    read-coverage: 1
    ploidy: 2
    allow-single-reads: true
    best-pile-up-margin: 1.5
    existing-gap-bonus: 3.0
    join-policy: contigs
    min-reads-per-pile-up: 1
    min-spanning-reads: 1
    proper-alignment-allowance: 500
process-pile-ups:
    tmpdir: /data/mschatz1/oushujun/projects/M82/dentist-example/temp/
    keep-temp: True

I finished rerunning DENTIST without errors, however, the /temp/ directory is empty. Could it be due to rerunning DENTIST in the same folder as previous runs? I then remove the hidden .snakemake folder and rerun it, but the screen output still suggests not using the /temp/ folder. For example:

[Fri Jul 1 18:04:54 2022] rule mask_tandem_block: input: workdir/M82_bac_210614_mod.dam, workdir/.M82_bac_210614_mod.bps, workdir/.M82_bac_210614_mod.hdr, workdir/.M82_bac_210614_mod.idx, workdir/.assembly.M82_bac_210614_mod, workdir/TAN.M82_bac_210614_mod.12.las output: workdir/.M82_bac_210614_mod.12.tan.anno, workdir/.M82_bac_210614_mod.12.tan.data log: logs/mask-tandem.M82_bac_210614_mod.12.log jobid: 39 wildcards: dam=M82_bac_210614_mod, block=12 resources: tmpdir=/tmp

Did I set something wrong? Thanks!

Shujun

a-ludi commented 2 years ago

Hi Shujun,

did Snakemake execute the process rules? Check with grep -A10 'rule process:' .snakemake/logs/[insert log file name].log. Snakemake reports the log file at the end of every run.

Theoretically, you should be able to just issue the same command as before and it should recompute mostly everything because many rules depend on the DENTIST config file.

Cheers!

oushujun commented 2 years ago

Hi Arne,

I think snakemake execute the process rules:

$ grep -A10 'rule process:' .snakemake/log/2022-07-04T122*
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    output: workdir2/insertions/batch.0.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    log: logs2/process.0.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    jobid: 30
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    wildcards: batch_id=0
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul  4 14:39:28 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    output: workdir2/insertions/batch.2.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    log: logs2/process.2.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    jobid: 421
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    wildcards: batch_id=2
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul  4 14:39:29 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log-Finished job 30.
--
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    output: workdir2/insertions/batch.1.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    log: logs2/process.1.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    jobid: 420
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    wildcards: batch_id=1
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log-    resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul  4 14:39:29 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log-Finished job 421.

But the specified /temp/ directory is still empty. I have specified new directories for results and log files, and remove previous .snakemake directories before running DENTIST. From the above logs, the temporaroy directory is tmpdir=/tmp not /temp/, does it means the specified path is not recognized?

Best, Shujun

a-ludi commented 2 years ago

Probably, the actual DENTIST config file was not updated properly. If feasible, please delete rm -rf ./workdir and run again. That should make sure that everything is run with the new config values.