dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
160 stars 11 forks source link

Compiling from release - empty module dirs and running in Singularity container #35

Closed mihkelvaher closed 4 years ago

mihkelvaher commented 4 years ago

Hi,

I'm planning to install dashing into a Singularity container (CentOS) but tried to install it on a server first (also CentOS).

-bash-4.2$ wget https://github.com/dnbaker/dashing/archive/v0.4.2.tar.gz -bash-4.2$ tar -zxvf v0.4.2.tar.gz -bash-4.2$ make fatal: Not a git repository (or any parent up to mount point /serverhome) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). make: *** No rule to make target sketch/include/sketch/cbf.h', needed bysrc/dashing.o'. Stop. -bash-4.2$ ls bonsai/ | wc -l 0 -bash-4.2$ ls distmat/ | wc -l 0 -bash-4.2$ ls khset | wc -l 0 -bash-4.2$ ls sketch/ | wc -l 0

The server has old gcc (4.8.5) but this is probably not the issue because making from cloned master breaks far later.

Unrelated: if no temporary files are created while creating the distance matrix, is everything held in memory? How large memory consumption is expected if running on thousands of assembled bacterial genomes (~5MB)? Asking for HPC resource allocation info.

Regards, Mihkel

dnbaker commented 4 years ago

Using the release directly won't work because it doesn't include dependencies. This just isn't possible on github currently.

I recommend building from a specific commit instead via:

git clone --recursive --single-branch --branch v0.4.2 https://github.com/dnbaker/dashing.

Would this work for you?

Regarding your question of memory consumption, in terms of RAM requirements, with n genomes each of 5 MB, sketch size p (log2 # of bytes), and t threads, Dashing with use approximately

• n(2*p) bytes (for sketches) • t (next_power_of_two(5e6)) bytes (for buffers of reading files in)

For reference (on all of RefSeq), our figure from the paper under distance used 100MB-1.4GB on sketches from p 10-14, so I'd guess that the answer is probably somewhere between 100MB and 500MB.

mihkelvaher commented 4 years ago

Didn't think of cloning a release branch! Thanks!

After a couple of days of trying, I can't get it compiled in a Singularity container (CentOS base), there's seems to be an issue with bonsai (the same issue occurs with making dashing).

git clone --recursive https://github.com/dnbaker/bonsai.git
cd bonsai/
make

gcc  -Iclhash/include -I. -I.. -Ilibpopcnt -I.. -Iinclude -Icircularqueue -Izstd/zlibWrapper -Izstd/lib/common -Izstd/lib  -Ihll/vec -Ihll -Ihll/include -Ipdqsort -Iinclude/bonsai -Iinclude -Ihll/vec/blaze -DNDEBUG -c klib/kthread.c -o klib/kthread.o -lz
gcc  -Iclhash/include -I. -I.. -Ilibpopcnt -I.. -Iinclude -Icircularqueue -Izstd/zlibWrapper -Izstd/lib/common -Izstd/lib  -Ihll/vec -Ihll -Ihll/include -Ipdqsort -Iinclude/bonsai -Iinclude -Ihll/vec/blaze -DNDEBUG -c klib/kstring.c -o klib/kstring.o -lz
ls clhash.o 2>/dev/null || mv clhash/clhash.o . 2>/dev/null || (cd clhash && git checkout master && make && cd .. && ln -s clhash/clhash.o .)
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
make[1]: Entering directory '/tmp/test/bonsai/clhash'
cc -fPIC -std=c99 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -c ./src/clhash.c -Iinclude
cc -fPIC -std=c99 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -o unit ./tests/unit.c -Iinclude  clhash.o
g++ -fPIC -std=c++11 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -o cppunit ./tests/cppunit.cpp -Iinclude  clhash.o
cc -fPIC -std=c99 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -o benchmark ./benchmarks/benchmark.c -Iinclude  clhash.o
cc -fPIC -std=c99 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -o example example.c -Iinclude  clhash.o
g++ -fPIC -std=c++11 -O3 -msse4.2 -mpclmul -march=native -funroll-loops -Wstrict-overflow -Wstrict-aliasing -Wall -Wextra -pedantic -Wshadow -o cppexample cppexample.cpp -Iinclude  clhash.o
make[1]: Leaving directory '/tmp/test/bonsai/clhash'
g++ -O3 -funroll-loops -pipe -fno-strict-aliasing -march=native -mpclmul   -fopenmp -fno-rtti -std=c++14 -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -DUSE_PDQSORT -Wunused-variable -Wno-attributes -Wno-cast-align -Wno-gnu-zero-variadic-macro-arguments -Wno-ignored-attributes -Wno-missing-braces -DBONSAI_VERSION=\"v0.2.4\" -DNDEBUG -Iclhash/include -I. -I.. -Ilibpopcnt -I.. -Iinclude -Icircularqueue -Izstd/zlibWrapper -Izstd/lib/common -Izstd/lib  -Ihll/vec -Ihll -Ihll/include -Ipdqsort -Iinclude/bonsai -Iinclude -Ihll/vec/blaze -L.  clhash.o klib/kthread.o -DNDEBUG bin/fahist.cpp -o bin/fahist -lz
bin/fahist.cpp:15:12: fatal error: zlib.h: No such file or directory
 #  include <zlib.h>
            ^~~~~~~~
compilation terminated.
make: *** [Makefile:125: bin/fahist] Error 1

But the library itself exists:

head zlib/zlib.h
/* zlib.h -- interface of the 'zlib' general purpose compression library
  version 1.2.11, January 15th, 2017

  Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,

Making with the same commands works with no problems on OSX and on a Ubuntu virtualbox.

mihkelvaher commented 4 years ago

I asked the admin to compile and add dashing to our HPC. While dashing is able to show the help, using dashing dist, giving files results in an error

dashing dist testgzs/*
Dashing version: v0.4.2
Illegal instruction

The .gz files with the same command work on osx. The same Illegal instruction comes up also with releases dashing_s128 and dashing_s256 while using some sample fastas and dist. dashing_s512 gives Illegal instruction instead of help. Does this Illegal instruction mean there's a compiling error? Or could it be some Debian/Red Hat issue?

Edit: creating a ubuntu container and running a release dashing from there results in the same error. BUT Suspecting it's something to do with the listed SSE2, AVX2, and AVX512BW, I checked /proc/cpuinfo which showed that sse2 is present.

dnbaker commented 4 years ago

I don't really understand that. I would expect it to work regardless based on the hardware available on the node you're compiling on or falling back to sse2. I test on CentOS personally and Travis checks Ubuntu, but I don't knowabout Debian/RedHat.

Sorry, I'm trying to catch a conference deadline and so I'm a bit slow to help this week.

Troubleshooting -- are you using the release/linux/*gz binaries, not the release/osx/*gz ones? I compiled those on CentOS.

mihkelvaher commented 4 years ago

I've finally managed to compile dashing the intended way and it seems to be working!

In the beginning, I tried to compile dashing in a Singularity container which resulted in the described bonsai issue. I'm doing all of the container building on my OSX because creating a container needs admin privileges. After the comment "compiling on the node" I tried just to make dashing but the HPC had an older version of gcc. Already using containers I installed a newer version of gcc into the container and tried to compile in it and through it but always got the bonsai issue.

Finally, I remembered that the cluster offers multiple versions of programs and loading gcc-9.1.0, compiling with it solved everything.

For my part, the issue can be closed, though it is a bit odd that compiling in a container fails.

dnbaker commented 4 years ago

That's strange. I wonder -- did you load zlib1g (or whatever the zlib package for your container) is?

Olga Botvinnik provided this Docker file a while back:

FROM ubuntu:16.04
MAINTAINER olga.botvinnik@czbiohub.org

WORKDIR /tmp

USER root

# Install basics
ENV PACKAGES git make ca-certificates zlib1g-dev build-essential curl wget cmake apt-utils

### don't modify things below here for version updates etc.

WORKDIR /home

RUN apt-get update && \
    apt-get install -y --no-install-recommends ${PACKAGES} && \
    apt-get clean

# Add add-apt-repository function
RUN apt-get update
RUN apt-get install -y software-properties-common

# Install gcc6 specifically
RUN add-apt-repository ppa:ubuntu-toolchain-r/test
RUN apt-get update && apt-get install -y g++-6
RUN g++ --version

# Install
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 60 --slave /usr/bin/g++ g++ /usr/bin/g++-6

WORKDIR /
RUN git clone https://github.com/dnbaker/dashing/
WORKDIR /dashing
RUN pwd
RUN make update dashing
RUN cp /dashing/dashing /bin

# Test that getting help on dashing command works
RUN dashing -h

WORKDIR /

I haven't personally used Singularity, but I wonder if it might contain any pointers.

mihkelvaher commented 4 years ago

This indeed gave the needed hint (though some other problem occurred)

I made a rookie mistake thinking the problem was somewhere else other than with missing zlib, because 1) yum said that zlib was already installed 2) bonsai had a zlib. While zlib existed, the devel version didn't. yum -y install zlib-devel.x86_64 did the trick.

Dashing compiled and dist shows the help, but unfortunately, the input files are not recognized:

Dashing version: v0.4.2
terminate called after throwing an instance of 'std::runtime_error'
  what():  [bonsai/include/bonsai/encoder.h:void bns::Encoder<ScoreType>::for_each(const Functor&, const char*, kseq_t*) [with Functor = bns::dist_sketch_and_cmp(const std::vector<std::__cxx11::basic_string<char> >&, std::vector<sketch::hk::HeavyKeeper<6, 10, bns::SeededHash<sketch::hash::WangHash> > >&, bns::KSeqBufferHolder&, FILE*, FILE*, bns::Spacer, unsigned int, unsigned int, sketch::hll::EstimationMethod, sketch::hll::JointEstimationMethod, bool, bns::EmissionType, bns::EmissionFormat, bool, unsigned int, bool, std::__cxx11::string, std::__cxx11::string, bool, bool, std::__cxx11::string, std::size_t, bns::EncodingType) [with SketchType = sketch::hll::hllbase_t<>; FILE = _IO_FILE; std::__cxx11::string = std::__cxx11::basic_string<char>; std::size_t = long unsigned int]::<lambda(const char*)>::<lambda(bns::u64)>; ScoreType = bns::score::Lex]435] Could not open file at testfastas/131_Escherichia_coli_JJ1886_uid226103_NC_022648.fna. Abort!

Aborted (core dumped)

Same message with both .fna and .fna.gz.

Trying to be smarter this time, I created an Ubuntu container translating Dockerfile to Singularity file so no dependency wouldn't be left out. Same result : /

The good news is that Dashing works on the HPC. The problem was because it was initially compiled on another node with some other processors.

dnbaker commented 4 years ago

Great. Does it have permission to open that file? This error is thrown when it can't open a handle to the file.

mihkelvaher commented 4 years ago

Chmoding 777 all of the fastas, fasta dir and even the container image still results in the same error. The idea might have some merit because going into the container and creating some dummy fastas gives a result but overall I think it's not worth exploring further.

As Dashing needs to be compiled on the same machine, it'll be run, containers have lost their point for me because I can only build containers on my laptop and only then run them in the server (which gives the Illegal instruction message). Containers would be of help there's a problem with compiling (can't use a newer version of gcc). I just tried out this approach and it works. For anyone interested in the "CompilerContainer", here's the Singularity recipe:

Bootstrap: docker
From: centos
%post 
    yum -y groupinstall "Development Tools" 
    yum -y install git gcc-c++ zlib-devel.x86_64

    # Uncomment this if you want to install dashing into the container
    # mkdir -pv /usr/local/bin/build && cd /usr/local/bin/build && git clone --recursive --single-branch --branch v0.4.2 https://github.com/dnbaker/dashing && cd dashing && make dashing && mv -v dashing /usr/local/bin/
%environment
    #nothing here currently
%runscript
    echo "run specific command, nothing here"

Thanks for the help! The initial results look promising and there are a couple of questions but I'll create a separate issue for that.