Open Adamtaranto opened 3 years ago
Hi,
yes, dentist.json
is the place. You can place them all under __default__
which means that they apply to every stage. Please remove the leading dashes (--
), e.g.:
{
"__default__": {
"max-insertion-error": 0.15,
"min-anchor-length": 1500,
"min-reads-per-pile-up": 1,
"min-spanning-reads": 1,
"allow-single-reads": true,
"join-policy": "contigs"
}
}
It seems like you want to create a more greedy configuration. Please be aware that you should not further reduce min-anchor-length
as it is already rather sensitive to repeat masking errors – it may even lead to reduced number of closed gaps because more scaffolding conflicts appear. Finally, here is an example for a greedy configuration I am intending to put into the next release:
{
"// WARNING": [
"Use with care!",
"Always validate the closed gaps (e.g. manual inspection)."
],
"// NOTE": [
"The workflow creates an intermediate assembly",
"`workdir/{output_assembly}-preliminary.fasta` that contains all",
"closed gaps, i.e. before validation. It is accompanied by an AGP",
"and BED file. You may inspect these file for maximum sensitivity."
],
"__default__": {
"verbose": 2,
"agp": "true",
"allow-single-reads": true,
"best-pile-up-margin": 1.5,
"existing-gap-bonus": 3.0,
"join-policy": "contigs",
"min-reads-per-pile-up": 1,
"min-spanning-reads": 1,
"proper-alignment-allowance": 500
},
"// Uncomment the following block if": [
"the default value for min-coverage-reads",
"(0.5 * --read-coverage/--ploidy) is suboptimal. It is important to",
"revert --read-coverage and --ploidy because they are mutually exclusive",
"with --min-coverage-reads."
]
}
@Adamtaranto Did this help? BTW you should consider updating DENTIST to v2.0.0. :wink:
hello @a-ludi,
I am trying to run DENTIST with the greedy mode using the example files. I used the dentist.greedy.yml
configuration to replace the original config:
cat dentist.greedy.yml
__default__:
verbose: 2
allow-single-reads: true
best-pile-up-margin: 1.5
existing-gap-bonus: 3.0
join-policy: contigs
min-reads-per-pile-up: 1
min-spanning-reads: 1
proper-alignment-allowance: 500
This is the snakemake config:
cat snakemake.yml
full_validation: true
dentist_container: dentist_3.0.0.sif
dentist_env: envs/dentist_v3.yml
dentist_config: dentist.greedy.yml
inputs:
reference: reference.fasta
reads: reads.fasta
reads_type: PACBIO_SMRT
outputs:
output_assembly: gap-closed.fasta
reference_dbsplit:
- -x1000
- -a
- -s50
reads_dbsplit:
- -x1000
- -a
- -s50
workdir: workdir_greedy_test
logdir: logs
threads_per_process: 20
propagate_batch_size: 14
batch_size: 50
validation_blocks: 2
I then run it under the local mode but encounter the following errors:
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all
Exception in line 434 of /home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile:
must specify either --read-coverage or --max-coverage-reads for command `mask-repetitive-regions`; must specify either --read-coverage and --ploidy or --min-coverage-reads for command `validate-regions
File "/home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile", line 796, in <module>
File "/home/sou6/bin/dentist.v3.0.0.x86_64/dentist-example/Snakefile", line 434, in full_validate_dentist_config
Can you give me some help? Thank you very much!
Best, Shujun
Hi Shujun,
just add a line with the proper read coverage below __default__
to dentist.greedy.yml
:
read-coverage: {coverage}
See https://github.com/a-ludi/dentist#how-to-choose-dentist-parameters for more details.
Cheers,
Arne
Dear Arne @a-ludi,
Thank you for your suggestion. I added read-coverage: 1
and encountered the following errors:
$ PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all Exception in line 434 of /data/mschatz1/oushujun/dentist-example/Snakefile: must specify either --read-coverage and --ploidy or --min-coverage-reads for command `validate-regions File "/data/mschatz1/oushujun/dentist-example/Snakefile", line 796, in
File "/data/mschatz1/oushujun/dentist-example/Snakefile", line 434, in full_validate_dentist_config
I then added min-coverage-reads: 1
to dentist.greedy.yml
and encountered the following errors:
Building DAG of jobs... MissingInputException in line 1133 of /data/mschatz1/oushujun/dentist-example/Snakefile: Missing input files for rule ref_vs_reads_alignment_block: output: workdir/M82.bac.210614.mod.m82_ont_polished.1.las, workdir/m82_ont_polished.1.M82.bac.210614.mod.las wildcards: block_reads=1 affected files: workdir/.M82.bac.210614.mod.dentist-self.anno workdir/.M82.bac.210614.mod.dentist-self.data workdir/.M82.bac.210614.mod.dust.data workdir/.M82.bac.210614.mod.tan.anno workdir/.M82.bac.210614.mod.dust.anno workdir/.M82.bac.210614.mod.tan.data
Do you have any ideas to move forward?
Thank you, Shujun
Hi Shujun,
I forgot you need to add ploidy: 2
, too. Typically, this is 2 please adjust to your genome assembly. Please remove min-coverage-reads
to avoid further error messages.
Cheers, Arne
HI Arne,
Thanks for your instant response. I removed min-coverage-reads
and added ploidy: 2
, it has the exact same errors as the one with min-coverage-reads.
Did the configure file skip any critical steps of dentist?
Best, Shujun
Please try starting over from the beginning by removing all outputs with rm -r workdir
.
I removed the workdir
and also .snakemake
and rerun the command PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all
, still having the same error. Any ideas?
Oh, yes. You have got dots in your filenames which confuse some tools in the workflow. Please replace all dots (except the suffix .fasta
o. course) by something else, eg. an underscore. That should do the trick.
Thank you for this trick! Yes, it's running now.
Hi Arne,
Thank you for your guidance. I have finished DENTIST on the contig mode (patching scaffolds with assembled contigs), but no gaps were filled. Are there any log files I can check if DENTIST is running correctly? Or are there any parameters I can change to make it work better? Thanks!
Best, Shujun
Hi @oushujun,
please follow the discussion in #33 for information on that. I am going to post more detailed information about digging into non-closed gaps there.
Hi @a-ludi,
Thank you for the instructions. Here are the log files you mentioned. logs.tar.gz.
The following are the configurations used:
snakemake.yml
full_validation: true
dentist_container: dentist_3.0.0.sif
dentist_env: envs/dentist_v3.yml
dentist_config: dentist.greedy.yml
inputs:
reference: /home/sou6/oushujun/M82/M82_bac_210614_mod.fasta
reads: /home/sou6/oushujun/M82/m82_ont_polished.fasta
reads_type: OXFORD_NANOPORE
outputs:
output_assembly: gap-closed.fasta
reference_dbsplit:
- -x1000
- -a
- -s50
reads_dbsplit:
- -x1000
- -a
- -s50
workdir: workdir
logdir: logs
threads_per_process: 20
propagate_batch_size: 14
batch_size: 50
validation_blocks: 2
dentist.greedy.yml
__default__:
verbose: 2
read-coverage: 1
ploidy: 2
allow-single-reads: true
best-pile-up-margin: 1.5
existing-gap-bonus: 3.0
join-policy: contigs
min-reads-per-pile-up: 1
min-spanning-reads: 1
proper-alignment-allowance: 500
Command used to run DENTIST:
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all
Please kindly let me know if you spot anything incorrectly specified or parameters I can used to improve. Thank you!
Best, Shujun
Hi Shujun,
I generated a report using the new Python script (see below). The main issue is "consensus alignment is invalid". That is not conclusive on its own but hints that there might be a bug in DENTIST. Could you please add the following lines (replace the path with something reasonable) to your dentist.yml
and re-run?
process:
tmpdir: /path/to/persistent/temp/directory
keep-temp: True
Then I am interested in the contents of that tmpdir
. Could you share that with me? If is it unreasonably big, share the log files first and I can tell you exactly which files I need.
In this run of DENTIST 108 potentially closable gaps were not closed. More details:
Hint: use DBshow -n workdir/[REFERENCE].dam | cat -n
to translate contig numbers to FASTA
coordinates.
collect
phase
--min-spanning-reads=1
)process
phase
--only=spanning
output
phase
--max-insertion-error=0.1
--join-policy=contigs
--min-extension-length=100
Hi Arne,
Thank you very much for checking my results and providing suggestions to debug. I am not very familar with snakemake, and I want to make sure I am doing it correctly.
So I added the three lines to dentist.greedy.yml
:
__default__:
verbose: 2
read-coverage: 1
ploidy: 2
allow-single-reads: true
best-pile-up-margin: 1.5
existing-gap-bonus: 3.0
join-policy: contigs
min-reads-per-pile-up: 1
min-spanning-reads: 1
proper-alignment-allowance: 500
process:
tmpdir: /home/sou6/oushujun/M82/dentist-example/temp
keep-temp: True
I rerun DENTIST and encounter the following error:
$ Exception in line 434 of /data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile: Error: invalid key
__process__
in config File "/data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile", line 796, inFile "/data/mschatz1/oushujun/projects/M82/dentist-example/Snakefile", line 434, in full_validate_dentist_config
Then I removed the process:
line showing as:
__default__:
verbose: 2
read-coverage: 1
ploidy: 2
allow-single-reads: true
best-pile-up-margin: 1.5
existing-gap-bonus: 3.0
join-policy: contigs
min-reads-per-pile-up: 1
min-spanning-reads: 1
proper-alignment-allowance: 500
tmpdir: /home/sou6/oushujun/M82/dentist-example/temp
keep-temp: True
DENTIST can be run with this yml file, the run is on going (84/457 steps), but the /temp
directory is still empty. Is this the right way to do it?
Thanks, Shujun
Hi Shujun,
that first error was my mistake, sorry. The command names in the config file cannot be abbreviated so it requires process-pile-ups
instead of just process
.
It is expected that the temp directory remains empty for quite a while because the first steps in the workflow do not involve the dentist
executable. Just let it continue.
You can just send me the directories /home/sou6/oushujun/M82/dentist-example/temp/dentist-process-pile-ups-*
. The rest is irrelevant for now.
HI Arne,
Thanks for the extra information. I added the process-pile-ups
lines to dentist.greedy.yml
as follows:
__default__:
verbose: 2
read-coverage: 1
ploidy: 2
allow-single-reads: true
best-pile-up-margin: 1.5
existing-gap-bonus: 3.0
join-policy: contigs
min-reads-per-pile-up: 1
min-spanning-reads: 1
proper-alignment-allowance: 500
process-pile-ups:
tmpdir: /data/mschatz1/oushujun/projects/M82/dentist-example/temp/
keep-temp: True
I finished rerunning DENTIST without errors, however, the /temp/ directory is empty. Could it be due to rerunning DENTIST in the same folder as previous runs? I then remove the hidden .snakemake
folder and rerun it, but the screen output still suggests not using the /temp/ folder. For example:
[Fri Jul 1 18:04:54 2022] rule mask_tandem_block: input: workdir/M82_bac_210614_mod.dam, workdir/.M82_bac_210614_mod.bps, workdir/.M82_bac_210614_mod.hdr, workdir/.M82_bac_210614_mod.idx, workdir/.assembly.M82_bac_210614_mod, workdir/TAN.M82_bac_210614_mod.12.las output: workdir/.M82_bac_210614_mod.12.tan.anno, workdir/.M82_bac_210614_mod.12.tan.data log: logs/mask-tandem.M82_bac_210614_mod.12.log jobid: 39 wildcards: dam=M82_bac_210614_mod, block=12 resources: tmpdir=/tmp
Did I set something wrong? Thanks!
Shujun
Hi Shujun,
did Snakemake execute the process rules? Check with grep -A10 'rule process:' .snakemake/logs/[insert log file name].log
. Snakemake reports the log file at the end of every run.
Theoretically, you should be able to just issue the same command as before and it should recompute mostly everything because many rules depend on the DENTIST config file.
Cheers!
Hi Arne,
I think snakemake execute the process rules:
$ grep -A10 'rule process:' .snakemake/log/2022-07-04T122*
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log- input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- output: workdir2/insertions/batch.0.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- log: logs2/process.0.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log- jobid: 30
.snakemake/log/2022-07-04T122919.244906.snakemake.log- wildcards: batch_id=0
.snakemake/log/2022-07-04T122919.244906.snakemake.log- threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log- resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul 4 14:39:28 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log- input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- output: workdir2/insertions/batch.2.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- log: logs2/process.2.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log- jobid: 421
.snakemake/log/2022-07-04T122919.244906.snakemake.log- wildcards: batch_id=2
.snakemake/log/2022-07-04T122919.244906.snakemake.log- threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log- resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul 4 14:39:29 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log-Finished job 30.
--
.snakemake/log/2022-07-04T122919.244906.snakemake.log:rule process:
.snakemake/log/2022-07-04T122919.244906.snakemake.log- input: workdir2/.M82_bac_210614_mod.dentist-self-H.anno, workdir2/.M82_bac_210614_mod.dentist-self-H.data, workdir2/.M82_bac_210614_mod.tan-H.anno, workdir2/.M82_bac_210614_mod.tan-H.data, workdir2/.M82_bac_210614_mod.dentist-reads-H.anno, workdir2/.M82_bac_210614_mod.dentist-reads-H.data, dentist.greedy.yml, workdir2/M82_bac_210614_mod.dam, workdir2/.M82_bac_210614_mod.bps, workdir2/.M82_bac_210614_mod.hdr, workdir2/.M82_bac_210614_mod.idx, workdir2/m82_ont_polished.dam, workdir2/.m82_ont_polished.bps, workdir2/.m82_ont_polished.hdr, workdir2/.m82_ont_polished.idx, workdir2/pile-ups.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- output: workdir2/insertions/batch.1.db
.snakemake/log/2022-07-04T122919.244906.snakemake.log- log: logs2/process.1.log
.snakemake/log/2022-07-04T122919.244906.snakemake.log- jobid: 420
.snakemake/log/2022-07-04T122919.244906.snakemake.log- wildcards: batch_id=1
.snakemake/log/2022-07-04T122919.244906.snakemake.log- threads: 20
.snakemake/log/2022-07-04T122919.244906.snakemake.log- resources: tmpdir=/tmp
.snakemake/log/2022-07-04T122919.244906.snakemake.log-
.snakemake/log/2022-07-04T122919.244906.snakemake.log-[Mon Jul 4 14:39:29 2022]
.snakemake/log/2022-07-04T122919.244906.snakemake.log-Finished job 421.
But the specified /temp/ directory is still empty. I have specified new directories for results and log files, and remove previous .snakemake
directories before running DENTIST. From the above logs, the temporaroy directory is tmpdir=/tmp
not /temp/
, does it means the specified path is not recognized?
Best, Shujun
Probably, the actual DENTIST config file was not updated properly. If feasible, please delete rm -rf ./workdir
and run again. That should make sure that everything is run with the new config values.
When setting the parameters below, do these need to be included in the dentist.json config file? and if so in which section?
--max-insertion-error --min-anchor-length --min-reads-per-pile-up --min-spanning-reads --allow-single-reads --join-policy