amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 731 forks source link

build error due to mismatch libnetcdf-c++4 API #226

Open jeng1220 opened 4 years ago

jeng1220 commented 4 years ago

It threw:

usr/include/ncException.h:27:7: note:   candidate expects 3 arguments, 4 provided
NNNetwork.cpp:4001:141: error: no matching function for call to 'netCDF::exceptions::NcException::NcException(const char [12], std::__cxx11::basic_string<char>, const char [14], int)'
                 throw NcException("NcException", "NNetwork::NNetwork: No weights supplied in NetCDF input file " + fname, __FILE__, __LINE__);

the available constructors of NcException in libnetcdf-c++4 are (link):

    class NcException : public std::exception {
    public:
      NcException(const char* complaint, const char* fileName, int lineNumber);
      NcException(int errorCode, const char* complaint, const char* fileName, int lineNumber);

there is no constructor accepting arguments like (const char*, const std::string&, const char*, int)

scottlegrand commented 4 years ago

Yeah, netcdf is a bad actor, arbitrarily changing API functionality and syntax with no macros to detect the version. I don't have a fix for this, and it remains similarly broken under Ubuntu 14 because of a netcdf bug that may or may not be fixed by now that forced a change in syntax for netcdf variables.

In this case, it looks like they changed the NcException API from the version that ships with Ubuntu 16.04. And that makes me wonder what else might have broken in the process.

scottlegrand commented 4 years ago

4.3 (which ships with 18.04) defines NcException as above, but 4.2 (which ships with 16.04) defines it as: NcException(const std::string& exceptionName,const std::string& complaint,const char* fileName,int lineNumber);

This is a trivial fix if we can find a compile-time version detection for netcdf. Googling so far has not been of much help.

spacelover1 commented 4 years ago

IDK why but The problem still exists and I'm getting the same error:

/usr/include/ncException.h:26:7: note: candidate: netCDF::exceptions::NcException::NcException(const string&, const string&, const char*, int)
       NcException(const std::string& exceptionName,const std::string& complaint,const char* fileName,int lineNumber);
       ^
/usr/include/ncException.h:26:7: note:   candidate expects 4 arguments, 3 provided

That and:

 #define NC_EXCEPTION(errorStr, msg, filename, line) NcException(std::string(msg).c_str(), filename, line)
                                                                                                         ^
NNLayer.cpp:3322:27: note: in expansion of macro 'NC_EXCEPTION'
                     throw NC_EXCEPTION("NcException", "NNLayer::NNLayer: No skip attributes supplied in NetCDF input file " + fname, __FILE__, __LINE__);
                           ^
In file included from NcExcptionWrap.h:2:0

Everything's fine to step 13/15 and then these happens.

By the way I have also changed the cub version in the Dockerfile since it didn't work when I installed it manually. Also commented lines related to cmake, and installed it manually:

# Add repositories and install base packages
RUN apt-get update && \
    apt-get install -y build-essential libcppunit-dev libatlas-base-dev pkg-config$
        software-properties-common unzip wget && \
#   add-apt-repository ppa:george-edison55/cmake-3.x && \
    apt-get update && \
#    apt-get install -y cmake && \
    apt-get clean

So, is there any steps that I'm missing or any changes should be applied on some files?

I'm not sure if this line should be changed or not:

FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04

Since I'm using ubuntu 18.04. GPU: RTX 2060

It's been a while I'm trying to build this engine, so any comments would be helpful. Thanks :)

jeng1220 commented 4 years ago

@spacelover1 ,

FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04
                                         ^^^^

I guess you built DSSTNE in container, which was Ubuntu16.04. Like what @scottlegrand mentioned, the default version of netCDF for Ubuntu16.04 is v4.2, which is too old for DSSTNE.

Thus, you need to upgrade netCDF to v4.3, or use the docker image based on Ubuntu18.04, then install the default netCDF.

jeng1220 commented 4 years ago

I think setup.md needs to be updated to remind users to use netCDF v4.3

spacelover1 commented 4 years ago

Thanks @jeng1220

The netCDF version of my system is 4.6, I couldn't downgrade its version because every time it fetches the latest available, and I'm trying to build this using the docker image on my system. Okay so here's the netCDF version:

$ nc-config --version
netCDF 4.6.0

and still getting the same error, here's the last lines:

/usr/include/ncException.h:26:7: note:   candidate expects 4 arguments, 3 provided
/usr/include/ncException.h:24:11: note: candidate: netCDF::exceptions::NcException::NcException(const netCDF::exceptions::NcException&)
     class NcException : public std::exception {
           ^
/usr/include/ncException.h:24:11: note:   candidate expects 1 argument, 3 provided
Makefile:58: recipe for target '/opt/amazon/dsstne/build/tmp/engine/cpp/NNLayer.o' failed
make[1]: Leaving directory '/opt/amazon/dsstne/src/amazon/dsstne/engine'
make[1]: *** [/opt/amazon/dsstne/build/tmp/engine/cpp/NNLayer.o] Error 1
Makefile:13: recipe for target 'engine' failed
make: *** [engine] Error 2
The command '/bin/sh -c cd /opt/amazon/dsstne &&     make install' returned a non-zero code: 2

and when I change the first line to FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04

I encounter some other errors one is mentioned in issue #224 and the other is : mpi.h: No such file or directory #include <mpi.h> so I changed the env path for openmpi.


UPDATE

Okay so I have changed some lines of the Dockerfile and now I get this error:

************  RELEASE mode ************
mpiCC -O3 -std=c++11 -fPIC -DOMPI_SKIP_MPICXX -MMD -MP -I/usr/local/include -isystem /usr/local/cuda/include -isystem /usr/lib/openmpi/include -isystem /usr/include/jsoncpp -IB40C -IB40C/KernelCommon -I/opt/amazon/dsstne/build/include -I../utils -c NNLayer.cpp -o /opt/amazon/dsstne/build/tmp/engine/cpp/NNLayer.o
In file included from NNLayer.cpp:14:0:
NcExcptionWrap.h:2:25: fatal error: ncException.h: No such file or directory
compilation terminated.
Makefile:58: recipe for target '/opt/amazon/dsstne/build/tmp/engine/cpp/NNLayer.o' failed
make[1]: Leaving directory '/opt/amazon/dsstne/src/amazon/dsstne/engine'
make[1]: *** [/opt/amazon/dsstne/build/tmp/engine/cpp/NNLayer.o] Error 1
Makefile:13: recipe for target 'engine' failed
make: *** [engine] Error 2
The command '/bin/sh -c cd /opt/amazon/dsstne &&     make install' returned a non-zero code: 2

I have no idea how to solve this one, any helps would be great.