Missing half of output files from test run

a-ludi / dentist

Close assembly gaps using long-reads at high accuracy.

https://a-ludi.github.io/dentist/

MIT License

47 stars 6 forks source link

Missing half of output files from test run #19

Closed rotoke closed 3 years ago

rotoke commented 3 years ago

Dear Arne,

I'm having some trouble getting dentist to run on our university cluster with the example dataset. I haven't used snakemake before, so please excuse if I missed something very basic. Here is what I've tried so far:

The symlinks to the .yml configuration files in the v.1.0.2-2 example folder appear broken on my system, and v.1.0.1 only has a drmaa config file. I thus took the ones from the pre-built binaries and adjusted them as follows:

Fixed some inconsistencies with file endings .yml / .yaml
Added -A PROJECT_NAME to the profile-slurm.submit-async.yml file (requirement on this cluster) and copied to $HOME/.config/snakemake/slurm/config.yaml
Changed partition from batch to the actual cluster partition name in the cluster.yml file
Changed file paths and increased max thread number to 32 in the snakemake.yml file
Changed read coverage and ploidy in the dentist.json file according to the one in v.1.0.1

Running the pipeline using snakemake v.6.4.0 and singularity v.3.7.1 finished in ~5.5h without any obvious error in the log files. However, there is no output and md5sum -c checksum.md5shows half of the files as missing:

gap-closed.fasta: FAILED open or read
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
md5sum: workdir/.gap-closed-preliminary.bps: No such file or directory
workdir/.gap-closed-preliminary.bps: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-self.anno: No such file or directory
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-self.data: No such file or directory
workdir/.gap-closed-preliminary.dentist-self.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: No such file or directory
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dentist-weak-coverage.data: No such file or directory
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dust.anno: No such file or directory
workdir/.gap-closed-preliminary.dust.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.dust.data: No such file or directory
workdir/.gap-closed-preliminary.dust.data: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.hdr: No such file or directory
workdir/.gap-closed-preliminary.hdr: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.idx: No such file or directory
workdir/.gap-closed-preliminary.idx: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.tan.anno: No such file or directory
workdir/.gap-closed-preliminary.tan.anno: FAILED open or read
md5sum: workdir/.gap-closed-preliminary.tan.data: No such file or directory
workdir/.gap-closed-preliminary.tan.data: FAILED open or read
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
md5sum: workdir/gap-closed-preliminary.dam: No such file or directory
workdir/gap-closed-preliminary.dam: FAILED open or read
md5sum: workdir/gap-closed-preliminary.fasta: No such file or directory
workdir/gap-closed-preliminary.fasta: FAILED open or read
md5sum: workdir/gap-closed-preliminary.gap-closed-preliminary.las: No such file or directory
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED open or read
md5sum: workdir/gap-closed-preliminary.reads.las: No such file or directory
workdir/gap-closed-preliminary.reads.las: FAILED open or read
workdir/reads.db: OK
md5sum: WARNING: 16 listed files could not be read

Here some system information:

Distributor ID: Scientific
Description:    Scientific Linux release 7.9 (Nitrogen)
Release:    7.9
Codename:   Nitrogen

And here the slurm submission script:


#SBATCH -J gorteria_dentist
#SBATCH -A PROJECT_NAME
#SBATCH --output=dentist_%A.out
#SBATCH --error=dentist_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=6:00:00
#SBATCH --mail-type=ALL
#SBATCH -p PARTITION

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load use.own
module load dentist
module load mambaforge
source activate /PATH/mambaforge/envs/snakemake

snakemake --configfile=snakemake.yaml --use-singularity --profile=slurm

Thank you very much for your help! Roman

a-ludi commented 3 years ago

Dear Roman,

thanks for considering DENTIST. I am preparing a new release right now and it will hopefully work better. For the time begin, here is some advice.

The symlinks to the .yml configuration files in the v.1.0.2-2 example folder appear broken on my system [...].

I am aware of that and fixed the problem.

Fixed some inconsistencies with file endings .yml / .yaml

I am aware of that and fixed the problem.

Added -A PROJECT_NAME to the profile-slurm.submit-async.yml file (requirement on this cluster) and copied to $HOME/.config/snakemake/slurm/config.yaml

Changed partition from batch to the actual cluster partition name in the cluster.yml file

Changed file paths and increased max thread number to 32 in the snakemake.yml file

Perfect!

Changed read coverage and ploidy in the dentist.json file according to the one in v.1.0.1

I was not aware of this. Thanks for pointing it out!

Running the pipeline using snakemake v.6.4.0 and singularity v.3.7.1 finished in ~5.5h without any obvious error in the log files. However, there is no output and md5sum -c checksum.md5shows half of the files as missing:

That's quite long actually. Is it because the jobs were queued for a long time? On my laptop with a rather weak 2 core CPU it only takes 10 minutes.

The problem is probably that Snakemake stopped execution after the collect rule which is incorrect and a bug in Snakemake. Please just try chmod -R +w workdir and execute Snakemake with the same command as before. I think then it should finish just fine. The *.las files may have bad checksums but that is nothing to worry about.

Best, Arne

rotoke commented 3 years ago

Dear Arne,

Thank you for the swift and helpful reply. I restarted the pipeline as suggested and got all OK the second try (the first time I tried I hit the Docker hub rate limit and thus had to wait until the next day).

That's quite long actually. Is it because the jobs were queued for a long time? On my laptop with a rather weak 2 core CPU it only takes 10 minutes.

Yes, the jobs spent several hours in the queue, which may be problematic for my actual assembly data as the maximum runtime is only 36h on this cluster. Is there any way to tune the performance a bit for the actual assembly besides increasing the CPU number in the snakemake.yml file? We have nodes with 32 CPUs and 32 x 12 = 384 GB RAM.

However, I assume it shouldn't be an issue in the end as the pipeline can be resumed any time?

Best, Roman

a-ludi commented 3 years ago

Awesome!

Yes, the jobs spent several hours in the queue, which may be problematic for my actual assembly data as the maximum runtime is only 36h on this cluster.

Usually, the runtime counter starts when the job gets executed so the queuing time is not included. The DENTIST's jobs nevery took that long for me.

For performance adjustments, take a look into the example snakemake.yml. You should not crank the number of CPUs too much because the performance does not grow linear. If you have a large assembly or huge read set you should adjust the -s parameter in reference_dbsplit and reads_dbsplit, respectively. It means that the assembly/read set is split into chunks of the given number of Mb. The number of chunks dictates the number of cluster jobs and higher number of jobs means more overhead. The same argument holds for propagate_batch_size and batch_size: higher number, less jobs.

If you you like you can report your assembly and read set size and I will advice on a specific choice of parameters.

However, I assume it shouldn't be an issue in the end as the pipeline can be resumed any time?

Yes.

rotoke commented 3 years ago

Super, thank you!

If you you like you can report your assembly and read set size and I will advice on a specific choice of parameters.

I have a ca. 1 Gbp Hi-C scaffolded PacBio assembly with 2826 gaps (ca. 2.1 Gbp if I add all the haplotigs back - see #20 )
The PacBio read set contains 18 million reads with an N50 of ca. 8.8 Kbp

a-ludi commented 3 years ago

Then you are probably good with the default parameters. How much read coverage do you have, or, how many base pairs are in the reads? From our experiments, we see that 30X are already sufficient. I would definitely reduce the read set if it has more than 50X – just take the first X Gb of reads.

rotoke commented 3 years ago

Hi! There are around 128 G base pairs in the raw PacBio reads, which theoretically gives 55 - 60 x coverage (assuming a 2n genome size of 2.1-2.2 Gbp). Real coverage must be much more uneven though due to the high heterozygosity and haplotig purging.

a-ludi commented 3 years ago

In that case, I would just use the whole read set for simplicity. But you may want to increase the block size for the reads, i.e. in reads_dbsplit set -s400. This should nearly halve the number of jobs. Let me know if that makes anything crash.

rotoke commented 3 years ago

Dear Arne,

Thank you for your recommendations. I started a run with the full read set, reference_dbsplitset to -s 200, and reads_dbsplitset to -s 400. Unfortunately this crashed with a bunch of errors:

First one:Error in rule mask_dust- I didn't see anything in the log files that would explain why:

INFO:    Using cached SIF image
Building DAG of jobs...
Updating job mask_dust.
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   mask_dust
        1
Select jobs to execute...

[Mon Jun 14 16:57:38 2021]
rule mask_dust:
    input: workdir/gorteria_dovetail.dam, workdir/.gorteria_dovetail.bps, workdir/.gorteria_dovetail.hdr, workdir/.gorteria_dovetail.idx, workdir/.assembly.gorteri$
    output: workdir/.gorteria_dovetail.dust.anno, workdir/.gorteria_dovetail.dust.data
    jobid: 0
    wildcards: dam=gorteria_dovetail

Activating singularity image [...]
[Mon Jun 14 16:58:25 2021]
Finished job 0.
1 of 1 steps (100%) done

Second one: Error in rule self_alignment_block. This happened to all 60 self-alignment jobs. Looking at the logs and the cluster job record, it seems like they all ran out of time. Would it help to increase the allocated time in the cluster.yml file?
In addition, some of the last jobs submitted again hit the Docker hub rate limit. I doubt I can make the cluster admins subscribe to this, so probably no solution here.

FATAL:   Unable to handle docker://aludi/dentist:stable uri: failed to get checksum for docker://aludi/dentist:stable: Error reading manifest stable in docker.io/aludi/dentist: toomanyrequests: You have reached your pull rate limit.
Building DAG of jobs...
Updating job mask_dust.
Updating job tandem_alignment_block.
Updating job mask_tandem_block.
Updating job tandem_alignment_block.
Updating job mask_tandem_block.
Updating job tandem_alignment_block.
Updating job mask_tandem_block.
Updating job tandem_alignment_block.
Updating job mask_tandem_block.
Updating job tandem_alignment_block.
Updating job mask_tandem_block.
Updating job mask_tandem.
Updating job self_alignment_block.
Using shell: /usr/bin/bash
Provided cores: 32
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   self_alignment_block
        1
Select jobs to execute...

[Mon Jun 14 19:07:37 2021]
rule self_alignment_block:
    input: dentist.json, workdir/gorteria_dovetail.dam, workdir/.gorteria_dovetail.bps, workdir/.gorteria_dovetail.hdr, workdir/.gorteria_dovetail.idx, workdir/.as$
    output: workdir/gorteria_dovetail.5.gorteria_dovetail.5.las, workdir/gorteria_dovetail.5.gorteria_dovetail.5.las
    log: logs/self-alignment.gorteria_dovetail.5.5.log
    jobid: 0
    wildcards: dam=gorteria_dovetail, block_ref=5, block_reads=5
    threads: 32

slurmstepd: error: *** JOB 42142017 ON cpu-e-385 CANCELLED AT 2021-06-14T19:08:01 DUE TO TIME LIMIT ***
slurmstepd: error: task_p_post_term: rmdir(/sys/fs/cgroup/cpuset/slurm42142017/slurm42142017.4294967294_0) failed Device or resource busy

Best, Roman

a-ludi commented 3 years ago

Hi Roman,

Regarding the Docker rate problem:

Download the container image using singularity pull docker://aludi/dentist:stable and adjust snakemake.yml to make use of the downloaded image:

dentist_container: "./dentist_stable.sif"

The next release will have this built into the workflow, so I will not create a separate issue for this.

Regarding the crashes:

You will need to look into the log files under ./logs, e.g. logs/self-alignment.gorteria_dovetail.5.5.log. Also check the status report of the cluster, it should report resource problems, such as timeout, distinctly.

I have one observation, which is that the parameters to DBsplit do allow for a space between the option and its value, meaning you have to specify -s200 and -s400, respectively.

rotoke commented 3 years ago

Hi Arne,

Thanks a lot for your workaround, which solved the docker rate problem. The other issues were not connected to DBsplit (I simply typed it wrong in my comment above - sorry for that), but to insufficient time and memory in the cluster.yml file.

I now managed to run almost the entire pipeline with my data, but unfortunately it failed again at the collectstep. The log files say the following:

Error: dentist.commandline.CLIException@source/dentist/commandline.d(3238): invalid argument <in:reference>: expected .dam or .db but got w

I'm not sure to which reference file this refers to, but I have an assembly.dam, a reads.db, and a pile-ups.db file in my workdir. I had to restart the pipeline many times to crank up time / RAM for individual steps, so maybe something went wrong along the way?

Thank you! Roman

a-ludi commented 3 years ago

Can you please run the pipeline once more with the additional snakemake parameter -p and look for checkpoint collect: in the log? Snakemake will print the exact command it issues after that and that's where something went wrong.

a-ludi commented 3 years ago

PS: after you have made it through the entire workflow (sorry for the hazzle!), could you please share your cluster.yml so I can increase the default value? It would be nice if most users don't have to care for that.

rotoke commented 3 years ago

Hi and thanks for the reply. I ran the pipeline again with -p and weirdly enough collect finished without further errors.

However, further down the pipeline preliminary_output stopped due to ProtectedOutputException. Executing chmod -R +w ./workdir makes workdir/insertions.db writeable again, but when I restart the run this is reset and the error appears again.

Do I need to wrap the command into the while loop as explained for single-machine execution, or is there another workaround for this?

I'm very sorry for all the questions, and happy to share my cluster.ymlfile after completion.

a-ludi commented 3 years ago

@rotoke I found a bug in the Snakemake workflow that may cause your current issue. Could you please update to DENTIST v2.0.0 and see if the problem stays? This easiest update path is to start from scratch. You could copy the workdir/*.las files to the new workdir to avoid recomputing them.

rotoke commented 3 years ago

Hi Arne,

Thank you very much for the update. I took dentist_v2.0.0.sif and the scripts from the new example folder, changed all the parameters (BTW: the snakemake.yml file still points to dentist_v1.0.2.sif), and managed to run dentist to completion with my data (the pipeline still stopped after collect, but could be restarted after chmod -R +w ./workdir):

Looking at the bam file, dentist closed 10 of the 2830 gaps in the assembly. I assume that this is expected for a de-novo assembly, and while other software may fill more gaps, the filled-in sequences are most likely erroneous?
I noticed that all scaffolds got renamed in the gap-closed assembly even though I did not enable the option to merge scaffolds. Is there any option to easily restore the original names?
Also, the entire assembly got converted to lowercase instead of a mixture of upper/lowercase letters to denote sequence quality (Not a big issue though as subsequent polishing will re-establish this information).
Counting gaps before and after I realised that there are not 10 but 12 gaps less in the gap-filled assembly. I was wondering why two of them did not appear in the bam file, and fished out the corresponding sequences:

Gap 1 before filling. This one is in a repetitive region at the very beginning of a scaffold:

>Scaffold_4__2_contigs__length_148605524
cgacttccggaacattttatcggtaatattgaaactactagatgatgttcggttacccgtttgggttttggactcgatat
ttttgccgtcccgttgcggcaccggctaattttttttcgatatcgcgttttgggcatctgcgggcttcaattcgacttcc
ggaacattttatcggtaatattgaaactactagatgatgttcggttacccggtttgggttttggactcgattatattttt
cgtcgcgttgccccaggcagccgcctacNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAGACCCTAGACCCTATACCATATACCCT
AAACCATAGACCCTAGACCGTAGACCCTAGACCGTAGGCCCTAGACCCCAGACCCTAGACCCAAGACCCTAGACCGTAAA
CCCTAGACCGTATATTGTAGACCATAGACCGTATAACCTAAACCGTAGACCCTAGATCCTAGACCCTAGACCGTAGACCG
TAGACCGTAAACCGTAGACCCTAGACCGTAGACCCTAGAACCAAGACCGTAGACCGTAGACCCTAGACCCTAGACCCGAG

Gap 1 after filling. It looks like the whole gap and sequence upstream got deleted:

>scaffold-1645
ccctagaccctagaccctataccatataccctaaaccatagaccctagaccgtagaccctagaccgtaggccctagaccc
cagaccctagacccaagaccctagaccgtaaaccctagaccgtatattgtagaccatagaccgtataacctaaaccgtag
accctagatcctagaccctagaccgtagaccgtagaccgtaaaccgtagaccctagaccgtagaccctagaaccaagacc
gtagaccgtagaccctagaccctagacccgagaccctagaccgtagaccctagaccaatgaccctagaccttagacccta
gaccatagaccctagaccctagaccctagaccttagcacctagaccctagaccgtagaccctagaccgtagaccctagac
cgtagaccctagaccgtagaccctagaccgtagaccttagaccctagacaataaaccctagaccgtaaaccctagaccgt
agaccctagaccctagaccgtaaaccttagaccgtagacccaagaccgtaaaccctagaccgtagaccctagacccaaga
ccttagagcctagaccctagaccgtagaccctagaccgttgacctaagaccctagaccgtagaccctagaccgtagatcc

Gap 2 before. Here, two small gaps are close to one another:

>Scaffold_3__2_contigs__length_195774257:125991876-125992176
TCCTCGTGATTTGATGAAATTGCCGAATTATTTCTATGGAATCACAAAGTTGCTCATATNNNNNNNNNNNNNNNNNNNNN
NNttactatgtaacaaatacatacaagtccctatactattacaatttcttggtcttacaactttctgctgcctcggtagt
ttgctcaaggtctgagggttgagtatctactgcttcttcagtagttttctcaacaatattttcttcagcagttacgattt
cctcagacgaaatctttttaacttgtttcttttgttgctttccttctttcttctttgaaggtaaccttgagggagaggtt
gcaatggttttttgcttctttgaagaaacaacagctgttgaagagtcgccttctatgattttgacatNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNaaagcacaagaagccaaagagggtttcatggccgagaaggatcattgagttaaaggaagagagagcaaatttt
tgcctcatgggtaaaatagaagaatca

Gap 2 after. It looks like instead of being filled, the two gaps got merged into one big one:

scaffold-980
tcctcgtgatttgatgaaattgccgaattatttctatggaatcacaaagttgctcatatnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnaaagcacaagaagccaaagagggtttcatggccgagaaggatcattgagttaaaggaagagagagcaaatttt
tgcctcatgggtaaaatagaagaatca

Is this expected behaviour, or did something go wrong? (I still had a few errors during the run, which seemed to complete just fine after restarting).

Thank you! Roman

rotoke commented 3 years ago

PS: As promised here my cluster.yml file. This worked for a 1 Gbp Hi-C scaffolded PacBio assembly with 2826 gaps, and a PacBio CLR read set with 18 million reads and an N50 of ca. 8.8 Kbp. Note that I gave a lot of time to collect; this timed out after 8h initially, and I just wanted to make sure it completes the second try as we are charged per CPUh on this cluster.

__default__:
    jobname:    dentist.{rule}
    partition:  partition
    nCPUs:  "{threads}"

mask_dust:
    jobname:    dentist.{rule}.{wildcards.dam}
    memory:     4096
    time:   "01:00"

tandem_alignment_block:
    jobname:    dentist.{rule}.{wildcards.dam}.{wildcards.block}
    memory:     32768
    time:   "00:30"

tandem_alignment:
    jobname:    dentist.{rule}.{wildcards.dam}
    memory:     1024
    time:   "00:30"

mask_tandem_block:
    jobname:    dentist.{rule}.{wildcards.dam}.{wildcards.block}
    memory:     32768
    time:   "00:30"

self_alignment_block:
    jobname:    dentist.{rule}.{wildcards.dam}.{wildcards.block_ref}.{wildcards.block_reads}
    memory:     32768
    time:   "04:00"

self_alignment:
    jobname:    dentist.{rule}.{wildcards.dam}
    memory:     8192
    time:   "02:00"

mask_self:
    jobname:    dentist.{rule}.{wildcards.dam}
    memory:     32768
    time:   "24:00"

ref_vs_reads_alignment_block:
    jobname:    dentist.{rule}.{wildcards.block_reads}
    memory:     32768
    time:   "02:00"

ref_vs_reads_alignment:
    memory:     8192
    time:   "08:00"

reads_vs_ref_alignment:
    memory:     1024
    time:   "01:00"

mask_reads:
    memory:     8192
    time:   "48:00"

propagate_mask:
    jobname:    dentist.propagate_mask
    memory:     32768
    time:   "60:00"
    nCPUs:  8

collect:
    memory:     32768
    time:   "24:00:00"

process:
    jobname:    dentist.{rule}.{wildcards.batch_id}
    memory:     16384
    time:   "24:00"

preliminary_output:
    memory:     16384
    time:   "16:00"

preliminary_gap_closed_vs_reads_alignment_block:
    jobname:    dentist.{rule}.{wildcards.block_reads}
    memory:     32768
    time:   "00:30"

preliminary_gap_closed_vs_reads_alignment:
    memory:     8192
    time:   "04:00"

validate_regions_block:
    jobname:    dentist.{rule}.{wildcards.block_ref}
    memory:     32768
    time:   "02:00"

unpurged_output:
    memory:     16384
    time:   "01:00"

purged_output:
    memory:     16384
    time:   "01:00"

a-ludi commented 3 years ago

(BTW: the snakemake.yml file still points to dentist_v1.0.2.sif)

I fixed that in the v2.0.0-2 release of the example. The tarball was not properly uploaded the first time and I fixed that now.

Looking at the bam file, dentist closed 10 of the 2830 gaps in the assembly. I assume that this is expected for a de-novo assembly, and while other software may fill more gaps, the filled-in sequences are most likely erroneous?

Yes, that fits my experience. In theory, there the assembler should have made every effort to build a contiguous assembly and any gap closer should not be able to do much more.

I noticed that all scaffolds got renamed in the gap-closed assembly even though I did not enable the option to merge scaffolds. Is there any option to easily restore the original names?

Currently, there is no easy option. However, the AGP could help after a small modification I have on my todo list anyway: the current AGP file lists the source coordinates in "Dazzler" coordinates. This is convenient for me because I work with the Dazzler formats most of the time but certainly unwieldy for the average user of DENTIST. After switching to FASTA coordinates the AGP file contains the new and original scaffold header of every scaffold like this:

scaffold-1  1   1841737 1   W   translocated_gaps_48    1   1841737 -   na
^^^^^^^^^^              ^   ^   ^
new scaffold name       |   |   old scaffold name
                    |   the first part is always a copy of the first
                    |   contig of the input scaffold
                    only look at the first part of each output scaffold

One can generate a translation table with awk -F'\t' -v'OFS=\t' '($4 == 1 && $5 == "W") { print $6, $1; }' gap-closed.agp.

I will also put a todo on my list to use the original scaffold names if in scaffoldGaps or scaffolds mode.

Also, the entire assembly got converted to lowercase instead of a mixture of upper/lowercase letters to denote sequence quality (Not a big issue though as subsequent polishing will re-establish this information).

DENTIST outputs all unmodified bases in lower case and inserted bases in upper case. It is a bit complicated to implement preserved casing because this information is discarded when converting the input assembly to Dazzler database. It is certainly possible and I am willing to implement it if the need is urgent.

Counting gaps before and after I realised that there are not 10 but 12 gaps less in the gap-filled assembly.

Yes, the reason is that very small contigs are discarded when converting to Dazzler database because DALIGNER cannot map reads to small contigs. I have this issue on my list but it is rather complicated to recapture them when assembling the gap-closed scaffolds, so I never got around fixing it. I will push it up in the priorities. :wink:

rotoke commented 3 years ago

Dear Arne,

Thank you very much again for your replies. I played around a bit with the software after completion and did a few more runs, which largely confirmed your predictions:

Yes, that fits my experience. In theory, there the assembler should have made every effort to build a contiguous assembly and any gap closer should not be able to do much more.

My assembly contains 6 chromosome-sized scaffolds and ca. 1730 short residual fragments. I ran dentist with "join-policy": "scaffolds"to see whether any of these would merge, which was not the case.

you can safely keep the haplotigs out of the assembly. DENTIST does not rely on having a single best mapping for every read but rather considers every chain of local alignments as a possibly true origin. Including the haplotigs could still avoid haplotype errors but at the expense of fewer closed gaps.

(from #20 ). I did another run with all haplotigs included, which reduced the number of closed gaps from 10 to 1.

In summary, default settings with the purged assembly seems to give the best results, so I will continue the assembly pipeline with this one.

Thank you again, and all the best, Roman

a-ludi commented 3 years ago

Thanks for sharing your experiences, Roman!

-- Arne