a-ludi / dentist

Close assembly gaps using long-reads at high accuracy.
https://a-ludi.github.io/dentist/
MIT License
47 stars 6 forks source link

md5checksum shows example dataset analysis fails #13

Open RishiDeKayne opened 3 years ago

RishiDeKayne commented 3 years ago

Hi, I've been trying to use dentist on the provided example dataset but a number of the md5 check sums after it finishes running are failing with no other errors that I can find.

I installed snakemake v6.0.0 and singularity v3.6.3 through conda and ran through the example dataset as follows:

wget https://bds.mpi-cbg.de/hillerlab/DENTIST/dentist-example.v1.0.1.tar.gz
tar -xzf ./dentist-example.v1.0.1.tar.gz
cd dentist-example

# run the workflow
SKIP_LACHECK=1 snakemake --configfile=snakemake.yaml --use-singularity --cores=4 

# validate the files
md5sum -c checksum.md5

but the checksum output was as follows:

gap-closed.fasta: FAILED
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED
workdir/gap-closed-preliminary.reads.las: FAILED
workdir/reads.db: OK
md5sum: WARNING: 15 computed checksums did NOT match

any advice on how to get the example dataset running would be greatly appreciated, Thanks, Rishi

a-ludi commented 3 years ago

Hi Rishi, could you share one of the logs/process.*.log files? Somebody else experienced failing md5sums like you do and the reason was that one of the auxiliary tools crashed in most of the calls for a yet unknown reason. Could you also share some more information about your system?

lsb_release -a
free -h
RishiDeKayne commented 3 years ago

Sure, I have attached process.1.log and the system info is as follows:

lsb_release -a

output:

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
free -h

output:

              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

process.1.log

a-ludi commented 3 years ago

As I suspected, it is the same error memory-associated error:

$ jq 'select((.exitStatus // 0) != 0)' process.1.log | head -n50
{
  "thread": 140513968151344,
  "logLevel": "diagnostic",
  "state": "post",
  "command": [
    "computeintrinsicqv",
    "-d19",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.db",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.pileup-55b-56f-chained-filtered.las"
  ],
  "output": [
    "allocation failure: Invalid argument cachelinesize=0 requested size is 24",
    "AutoArray<unsigned long,alloc_type_memalign_cacheline> failed to allocate 3 elements (24 bytes)",
    "current total allocation 467987",
    "",
    ""
  ],
  "exitStatus": 1,
  "timestamp": 637514242850290800,
  "action": "execute",
  "type": "command"
}
... (many more instances with the same signature)

The problem is clearly not related to a lack of memory. Since I have no in-depth knowledge of computeintrinsicqv, I will ask the author for help.

In the meantime, you may try running it on a different machine.

a-ludi commented 3 years ago

Information from other user:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          1.0Ti        15Gi       2.5Gi       4.1Gi       989Gi       982Gi
Swap:            0B          0B          0B
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"

BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
RishiDeKayne commented 3 years ago

Hi again, Weirdly I reran the example set each of our computing nodes - it failed on every one of our big memory machines but ran on our regular machines. I did the same system checks as above but cant find anything obviously different between the two so I'm still not sure what could be causing it. In case it is helpful:

$ lsb_release -a

##WORKED - regular 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

##FAILED - big-memory 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

$ free -h

##WORKED - regular
              total        used        free      shared  buff/cache   available
Mem:           360G        5.2G        332G        4.4M         21G        354G
Swap:          8.0G        8.0G         88K

##FAILED - big memory  
              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

and now all checksum outputs say 'OK'

a-ludi commented 3 years ago

Hmm, interesting. I will try running the example on a 1TB memory machine as well. Maybe there is some bug related to large pointers.

shri1984 commented 3 years ago

Hi, I have the same issue. md5sum -c checksum.md5 failed (15 cases). I am using a machine with 2 TB RAM (Ubuntu).

a-ludi commented 3 years ago

I tried it on one of our big memory machines and it worked as expected:

# submit job with 8 cores
$ sbatch -c8 -pbigmem --wrap='snakemake --configfile=snakemake.yaml --use-singularity --cores=$SLURM_JOB_CPUS_PER_NODE'
# memory information about the machine
$ ssh r01n03 free -h
              total        used        free      shared  buff/cache   available
Mem:           1.0T        964G         40G        1.6G        2.5G         39G
Swap:            0B          0B          0B
# OS information about the machine
$ ssh r01n03 lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:la
nguages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:        7.4.1708
Codename:       Core

So I conjecture (:smile:) that it is not just the amount of total or available memory that causes the bug. But I still have no clue what's going on. Also, I have not heard anything from the author of daccord (see this issue). I will keep digging.

a-ludi commented 3 years ago

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2161e3345283553e51671b702fcf73ce45). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

shri1984 commented 3 years ago

Thanks @a-ludi. example data set went fine including the md5sum. The latest version helped. I am trying dentist on my hic scaffolded hifi assembly. I will post the update here.

lizhao007 commented 1 year ago

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

Thanks for your work, but I get the same issue with example data by the latest version (v4.0.0) — md5sum -c checksum.md5 failed (15 cases). The information about the machine is:

              total        used        free      shared  buff/cache   available
Mem:           2.0T        535G        1.4T         56M        5.7G        1.4T
Swap:          4.0G        2.1G        1.9G

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.5.1804 (Core) 
Release:    7.5.1804
Codename:   Core
a-ludi commented 1 year ago

Hi @lizhao007 ,

could you please share the list of files that failed the checksum test? I need it to get an idea what went wrong.