GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

some cDBG links seem not to be bidirectional #34

Closed ctb closed 6 years ago

ctb commented 6 years ago

in spacegraphcats, we are consuming the output of your excellent tool and doing (or attempting to do) Clever Graph Things with it. (thank you for bcalm!) however, we are running into a problem that in some cases there are nodes u, v where node u has an edge to v but node v does not have an edge to u.

more specifically, we see a number of examples like the following two headers in the unitigs output:

>206849 LN:i:41 KC:i:96 km:f:8.7  L:-:55026:- L:-:69342:-  L:+:103808:- L:+:203510:-
...
>69342 LN:i:134 KC:i:2260 km:f:21.7  L:+:69614:- L:+:155497:-  L:-:61581:+ L:-:167785:+
...

which show that 69342 is in the edge list of 206849, but 206849 is not in the edge list of 69342.

We have code to detect this here,

https://github.com/spacegraphcats/spacegraphcats/pull/190/commits/dc9a9cc8c2b07b27b2547341748f67b809e1025f

and I am happy to provide full execution instructions should you wish (should be: checkout that branch; pip install -r requirements.txt; make twofoo.fq.gz; conf/run twofoo build)

your thoughts appreciated!


output from latest bcalm master is here: https://osf.io/zgx3r/ (generated with bcalm built from commit c41f70fe0ba on my mac os x laptop)

input files are generated from the rules here,

briefly

curl -o akker-reads.abundtrim.gz -L https://osf.io/dk7nb/download
curl -L 'https://osf.io/7az9p/?action=download' > shew-reads.abundtrim.gz
gunzip -c shew-reads.abundtrim.gz akker-reads.abundtrim.gz | gzip -9c > twofoo.fq.gz

and then the command line execution is:

bcalm -in bcalm.twofoo.k31.inputlist.txt -out twofoo/bcalm.twofoo.k31 -kmer-size 31 -abundance-min 1

rchikhi commented 6 years ago

can you confirm that this bug also occurs on linux? I can't reproduce it.

continuing on the 'briefly' script:

wget https://github.com/GATB/bcalm/releases/download/v2.2.0/bcalm-binaries-v2.2.0-Linux.tar.gz
wget https://raw.githubusercontent.com/GATB/bcalm/master/scripts/convertToGFA.py
tar xf bcalm-binaries-v2.2.0-Linux.tar.gz
bcalm-binaries-v2.2.0-Linux/bin/bcalm -in twofoo.fq.gz -abundance-min 1
python convertToGFA.py twofoo.fq.unitigs.fa twofoo.fq.unitigs.gfa 31
python detect_non_reci_edges.py twofoo.fq.unitigs.gfa |wc -l

with detect_non_reci_edges.py containing:

import sys
edges = set()
for line in open(sys.argv[1]):
    if not line.startswith('L'): continue
    line = line.split()
    a,b = map(int,tuple((line[1],line[3])))
    edges.add((a,b))

for a,b in edges:
    if (b,a) not in edges:
        print(a,b)

but on the unitigs file there https://osf.io/zgx3r/:

python convertToGFA.py bcalm.twofoo.k31.unitigs.fa bcalm.twofoo.k31.unitigs.gfa 31
python detect_non_reci_edges.py bcalm.twofoo.k31.unitigs.gfa |wc -l

returns 20889, indicating that I can confirm that there is an issue somewhere.

ctb commented 6 years ago

got it

rchikhi commented 6 years ago

btw reload this page, I edited the comment before

ctb commented 6 years ago

for a fresh build on our HPC of v2.2.0,

% python convertToGFA.py bcalm.twofoo.k31.unitigs.fa twofoo.fq.unitigs.gfa 31
GFA file open
done
% python detect_non_reci_edges.py  twofoo.fq.unitigs.gfa | wc -l
13874
ctb commented 6 years ago

(I can't run the downloaded version of bcalm on my HPC for whatever reason - library incompat)

ctb commented 6 years ago

hi @rchikhi I do not see non-reciprocal edges on the downloaded version using binder so it seems like there is maybe a compile time issue that is biting me on both Linux and Mac OS X!?

Any suggestions on how to go about debugging? I'm not doing anything weird on my mac, and I'm using gcc 4.4.7 on the linux box...?

ctb commented 6 years ago

Perhaps the next step is to compile it on the binder box. That requires some reconfig tho.

ctb commented 6 years ago

ok, I ended up compiling v2.2.0 on an AWS instance running Ubuntu. You'll be relieved to know that it works fine (i.e. all edges are reciprocal):

python3 detect.py twofoo.fq.unitigs.gfa | wc -l
0

so both my mac and my linux HPC have issues!? Any ideas or thoughts?

rchikhi commented 6 years ago

hey, thanks for investigating. What are the compiler versions on those two machines that have issues?

ctb commented 6 years ago

Linux with problems:

gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) 

on Mac OS X with problems:

% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -v
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

(this is what cmake is using to build bcalm)


on the AWS instance, which works properly:

% gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
rchikhi commented 6 years ago

on the Linux with gcc 4.4.7, I'm not sure how you got it to run. I've just tried on a old CentOs box with the same gcc version, the pre-compiled binary complains about my libc version being too old (version GLIBC_2.14 not found) and the source won't compile with that gcc version.

rchikhi commented 6 years ago

I tested on a similar linux as yours (Linux dx405-s06 2.6.32-696.23.1.el6.x86_64 #1 SMP Tue Mar 13 22:44:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux) that had gcc 4.4.7 by default, but I could only compile bcalm with a newer gcc (took 7.2 from conda). Still couldn't reproduce the bug

ctb commented 6 years ago

don't know what to tell you, I did build it natively on that version :)

any ideas for how to narrow down what might be going on? I am happy to perform analyses of unitig and k-mer content etc., which will be my next step by default.

rchikhi commented 6 years ago

I'm not sure where to find a OSX system, so let's focus on how you got that EL6 Linux with gcc4.4.7 to compile Bcalm. Can you please paste the output of the cmake step?

ctb commented 6 years ago

hmm, looks like I made a mistake reporting on the gcc version -- I just ran cmake again, and decided to rerun gcc -v:

% gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/software/GCC/6.2/libexec/gcc/x86_64-pc-linux-gnu/6.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.2.0/configure --disable-multilib --prefix=/opt/software/GCC/6.2
Thread model: posix
gcc version 6.2.0 (GCC)

apologies. But, in any case, here's cmake . output. cmake.txt

ctb commented 6 years ago

Some updates:

using sourmash, I calculated the Jaccard similarity of the two output files, one generated on my Mac (as above) and one generated on AWS (as above). They are identical in k-mer content. Hooray!

for those same files, I then redid the GFA and reciprocal analysis step via the scripts you provided. The Mac output version had 16795 non-reciprocal edges, while the AWS version had 0.

So whatever the problem is, it's in the FASTA headers rather than in the sequence content. Which seems quite good :)

I then dug into the output a bit, and saw the following difference:

aws (good):

Glue partition 128 has 4046 sequences 
Glue partition 479 has 3830 sequences 
Glue partition 37 has 3721 sequences 
Glue partition 308 has 3667 sequences 
Glue partition 434 has 3531 sequences 
Glue partition 67 has 3455 sequences 
Glue partition 60 has 3419 sequences 
Glue partition 358 has 3396 sequences 
Glue partition 250 has 3360 sequences 
Glue partition 383 has 3350 sequences 

Mac (bad):

Glue partition 10 has 11625 sequences 
Glue partition 85 has 11545 sequences 
Glue partition 110 has 11510 sequences 
Glue partition 39 has 11363 sequences 
Glue partition 74 has 11283 sequences 
Glue partition 24 has 11253 sequences 
Glue partition 55 has 11223 sequences 
Glue partition 41 has 11221 sequences 
Glue partition 96 has 11179 sequences 
Glue partition 9 has 11162 sequences 

Despite this, both unitig files have virtually same size (!?) - I'm not quite sure how that's possible given the apparent differences in FASTA headers...

I attach both sets of output.

bcalm-bad-mac.txt bcalm-good-aws.txt

rchikhi commented 6 years ago

I appreciate that you keep investigating. I finally got around an issue with running the compiled bcalm on CentOS6 with gcc 6.1 installed using conda. (It was /usr/lib64/libstdc++.so.6: version GLIBCXX_3.4.20 not found and using LD_PRELOAD=$HOME/miniconda2/envs/gcc-61/lib/libstdc++.so.6.0.24 ./bcalm did the trick)... and.. guess what... still no edge problem. So, on the same Linux distro and almost the same GCC version (6.1 versus 6.2), still can't reproduce it.

rchikhi commented 6 years ago

The reason why unitigs have the same size and different glue step, could be due to the fact that Bcalm autodetects the the number of files that can be opened simultaneously on your system. To have deterministic execution, you can use -nb-cores 1 -nb-glue-partitions 100.

rchikhi commented 6 years ago

A possible way to investigate further would be to compile bcalm in debug mode on OSX and run valgrind.

cd build
rm -Rf *
cmake -D CMAKE_BUILD_TYPE=Debug ..
make -j 8
valgrind ./bcalm  -in twofoo.fq.gz -abundance-min 1 >/dev/null
camillescott commented 6 years ago

To deepen the mystery, on my HPCC (same system as Titus) account with same gcc module:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/software/GCC/6.2/libexec/gcc/x86_64-pc-linux-gnu/6.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.2.0/configure --disable-multilib --prefix=/opt/software/GCC/6.2
Thread model: posix
gcc version 6.2.0 (GCC)

And the detection script from @rchikhi, I am unable to reproduce.

(space) welcherc@dev-intel16:~/src/spacegraphcats$ python detect_bad_edges.py twofoo/bcalm.twofoo.k31.unitigs.gfa | wc -l
0
camillescott commented 6 years ago

And ldd -v on the compiled binary for @ctb and me shows exact same ABI versions and paths for everything...

camillescott commented 6 years ago

Aaahh but @ctb is on bcalm c8ac60252fa0b2abf511f7363cff7c4342dac2ee (Dec 20) and I'm on c41f70fe0ba78d0491fd65c7618994b3a1f86d7c (most recent, April 12), might be that...

camillescott commented 6 years ago

Trying it out on the older commit also fails to reproduce, but double checking it appears I forgot to revert the gatb submodule, so for now I'm gonna say that's where to look. Off to see Infinity War, good luck ;)

rchikhi commented 6 years ago

Hmm.. what about with the released 2.2.0 version? https://github.com/GATB/bcalm/releases/download/v2.2.0/bcalm-binaries-v2.2.0-Linux.tar.gz

camillescott commented 6 years ago

No joy; the glibc version is too old:

bcalm: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by bcalm)
(space) welcherc@dev-intel14:~/src/spacegraphcats$ /lib64/libc.so.6
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-18).
Compiled on a Linux 2.6.32 system on 2017-06-20.
Available extensions:
    The C stubs add-on version 2.1.2.
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
    RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
camillescott commented 6 years ago

Also just tried recompiling from @ctb's version of bcalm with the matching gatb version; still couldn't reproduce. I'm at a loss, seems like a gremlin to me.

rchikhi commented 6 years ago

on the same machine as him?!

rchikhi commented 6 years ago

BTW if you have any idea how to make the binary generated by Travis more compatible to old glibc's, I'd gladly take it :)

asl commented 6 years ago

@rchikhi We're using https://github.com/phusion/holy-build-box for making statically linked SPAdes binaries. Though these days we're having custom docker image with gcc 5.3 inside (but still hbb-based).

rchikhi commented 6 years ago

thanks for the reminder about hbb Anton. We do use an old Centos to build Minia and GATB stuff, but I wanted to experiment with Travis for bcalm. Combinin hbb and Travis seems uncharted territory. Some people hacked their way through but looks like lot of hassle.

ctb commented 6 years ago

@camillescott and others, I've automated the whole thing here and provided the data set in the github repo; this is all by way of seeing if someone else can replicate the thing :)

https://github.com/ctb/2018-bcalm-debug

On my laptop with a build of v2.2.0 of bcalm:

% snakemake
...
RESULT: 812 non-reciprocal edges detected; should be 0
...

On AWS with a build of latest:

RESULT: 0 non-reciprocal edges detected; should be 0

etc.

for @camillescott there are instructions on how to execute it using my environment on HPC in the README.

ctb commented 6 years ago

OK, using the new automated workflow, and starting from a clean checkout ov v2.2.0, I no longer see the problem on any of my platforms --

 rm -fr bcalm
git clone --recursive https://github.com/GATB/bcalm -b v2.2.0
cd bcalm
cmake .
make -j 8
sudo cp bcalm /usr/local/bin/bcalm
cd ../2018-bcalm-debug/
snakemake clean
snakemake all

works on Mac OS X, HPC, and AWS.

My current guess is that somehow I got stuck with a version mismatch with the gatb-core, and that starting from a fresh checkout of bcalm as above is the right solution there.

Thanks for bearing with, all!

rchikhi commented 6 years ago

Do you still have a version that had the bug? If so you could get maybe peace of mind by doing git branch inside the gatb-core folder and see if it was maybe detached. Anyhow, appreciate that it got resolved!

ctb commented 6 years ago

If I find it happening again, I will do so :)