Open RishiDeKayne opened 3 years ago
Hi Rishi, could you share one of the logs/process.*.log
files? Somebody else experienced failing md5sum
s like you do and the reason was that one of the auxiliary tools crashed in most of the calls for a yet unknown reason. Could you also share some more information about your system?
lsb_release -a
free -h
Sure, I have attached process.1.log and the system info is as follows:
lsb_release -a
output:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
free -h
output:
total used free shared buff/cache available
Mem: 995G 1.8G 229G 324K 764G 988G
Swap: 8.0G 1.5G 6.5G
As I suspected, it is the same error memory-associated error:
$ jq 'select((.exitStatus // 0) != 0)' process.1.log | head -n50
{
"thread": 140513968151344,
"logLevel": "diagnostic",
"state": "post",
"command": [
"computeintrinsicqv",
"-d19",
"/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.db",
"/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.pileup-55b-56f-chained-filtered.las"
],
"output": [
"allocation failure: Invalid argument cachelinesize=0 requested size is 24",
"AutoArray<unsigned long,alloc_type_memalign_cacheline> failed to allocate 3 elements (24 bytes)",
"current total allocation 467987",
"",
""
],
"exitStatus": 1,
"timestamp": 637514242850290800,
"action": "execute",
"type": "command"
}
... (many more instances with the same signature)
The problem is clearly not related to a lack of memory. Since I have no in-depth knowledge of computeintrinsicqv
, I will ask the author for help.
In the meantime, you may try running it on a different machine.
Information from other user:
$ free -h
total used free shared buff/cache available
Mem: 1.0Ti 15Gi 2.5Gi 4.1Gi 989Gi 982Gi
Swap: 0B 0B 0B
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
Hi again, Weirdly I reran the example set each of our computing nodes - it failed on every one of our big memory machines but ran on our regular machines. I did the same system checks as above but cant find anything obviously different between the two so I'm still not sure what could be causing it. In case it is helpful:
$ lsb_release -a
##WORKED - regular
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
##FAILED - big-memory
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
$ free -h
##WORKED - regular
total used free shared buff/cache available
Mem: 360G 5.2G 332G 4.4M 21G 354G
Swap: 8.0G 8.0G 88K
##FAILED - big memory
total used free shared buff/cache available
Mem: 995G 1.8G 229G 324K 764G 988G
Swap: 8.0G 1.5G 6.5G
and now all checksum outputs say 'OK'
Hmm, interesting. I will try running the example on a 1TB memory machine as well. Maybe there is some bug related to large pointers.
Hi, I have the same issue. md5sum -c checksum.md5 failed (15 cases). I am using a machine with 2 TB RAM (Ubuntu).
I tried it on one of our big memory machines and it worked as expected:
# submit job with 8 cores
$ sbatch -c8 -pbigmem --wrap='snakemake --configfile=snakemake.yaml --use-singularity --cores=$SLURM_JOB_CPUS_PER_NODE'
# memory information about the machine
$ ssh r01n03 free -h
total used free shared buff/cache available
Mem: 1.0T 964G 40G 1.6G 2.5G 39G
Swap: 0B 0B 0B
# OS information about the machine
$ ssh r01n03 lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:la
nguages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.4.1708 (Core)
Release: 7.4.1708
Codename: Core
So I conjecture (:smile:) that it is not just the amount of total or available memory that causes the bug. But I still have no clue what's going on. Also, I have not heard anything from the author of daccord (see this issue). I will keep digging.
@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2161e3345283553e51671b702fcf73ce45). I would be very happy if you could test the example again and see if it works.
The issue (likely) was that I used Alpine Linux in the Container which has its own libc
implementation that is not 100% compatible with glibc
used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.
Thanks @a-ludi. example data set went fine including the md5sum. The latest version helped. I am trying dentist on my hic scaffolded hifi assembly. I will post the update here.
@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2). I would be very happy if you could test the example again and see if it works.
The issue (likely) was that I used Alpine Linux in the Container which has its own
libc
implementation that is not 100% compatible withglibc
used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.
Thanks for your work, but I get the same issue with example data by the latest version (v4.0.0) — md5sum -c checksum.md5 failed (15 cases). The information about the machine is:
total used free shared buff/cache available
Mem: 2.0T 535G 1.4T 56M 5.7G 1.4T
Swap: 4.0G 2.1G 1.9G
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.5.1804 (Core)
Release: 7.5.1804
Codename: Core
Hi @lizhao007 ,
could you please share the list of files that failed the checksum test? I need it to get an idea what went wrong.
Hi, I've been trying to use dentist on the provided example dataset but a number of the md5 check sums after it finishes running are failing with no other errors that I can find.
I installed snakemake v6.0.0 and singularity v3.6.3 through conda and ran through the example dataset as follows:
but the checksum output was as follows:
any advice on how to get the example dataset running would be greatly appreciated, Thanks, Rishi