apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

fail to build using docker #16753

Closed alopez1327 closed 4 years ago

alopez1327 commented 4 years ago

Description

Tried building library using docker through the command: python3 ci/build.py -p armv7 but compilation failed because it seems docker is pulling some binaries from and old repository (jessie). Tried running the same code with --force-yes with the same result.

Also tried to compile directly on Raspberry pi, but it fails because of insufficient VM, even after I increased it to 4096M. I'll move to cross-compilation next.

Error Message

WARNING: The following packages cannot be authenticated! autoconf E: There are problems and -y was used without --force-yes The command '/bin/sh -c /work/deb_ubuntu_ccache.sh' returned a non-zero code: 100 Traceback (most recent call last): File "ci/build.py", line 454, in sys.exit(main()) File "ci/build.py", line 364, in main num_retries=args.docker_build_retries, no_cache=args.no_cache) File "ci/build.py", line 116, in build_docker run_cmd() File "/Volumes/External/mxnet/ci/util.py", line 81, in f_retry return f(*args, **kwargs) File "ci/build.py", line 114, in run_cmd check_call(cmd) File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['docker', 'build', '-f', 'docker/Dockerfile.build.armv7', '--build-arg', 'USER_ID=501', '--build-arg', 'GROUP_ID=20', '--cache-from', 'mxnetci/build.armv7', '-t', 'mxnetci/build.armv7', 'docker']' returned non-zero exit status 100.

To Reproduce

Just run python3 ci/build.py -p armv7

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Just run python3 ci/build.py -p armv7

What have you tried to solve it?

  1. Added --force-yes when installing autoconf
  2. Added https package just in case authentication was due to https site
  3. Realized that autoconf may not be available because jessie distribution may be too old?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

----------Python Info---------- Version : 3.7.5 Compiler : Clang 11.0.0 (clang-1100.0.33.12) Build : ('default', 'Nov 2 2019 10:02:20') Arch : ('64bit', '') ------------Pip Info----------- Version : 19.3.1 Directory : /Volumes/External/pyenv/tvmenv/lib/python3.7/site-packages/pip ----------MXNet Info----------- Version : 1.4.1 Directory : /Volumes/External/pyenv/tvmenv/lib/python3.7/site-packages/mxnet An error occured trying to import mxnet. This is very likely due to missing missing or incompatible library files. Traceback (most recent call last): File "", line 122, in check_mxnet AttributeError: module 'mxnet.util' has no attribute 'get_gpu_count'

----------System Info---------- Platform : Darwin-19.0.0-x86_64-i386-64bit system : Darwin node : Catalyst-RD.local release : 19.0.0 version : Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 ----------Hardware Info---------- machine : x86_64 processor : i386 b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz' b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C' b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MDCLEAR IBRS STIBP L1DF SSBD' b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT RDTSCP TSCI' ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0310 sec, LOAD: 0.4920 sec. Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0005 sec, LOAD: 0.5698 sec. Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0007 sec, LOAD: 0.0289 sec. Timing for D2L: http://d2l.ai, DNS: 0.0008 sec, LOAD: 0.3076 sec. Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0006 sec, LOAD: 0.1726 sec. Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0007 sec, LOAD: 0.3055 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0089 sec, LOAD: 1.1580 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0005 sec, LOAD: 0.1895 sec.

alopez1327 commented 4 years ago

Tried to fix the error by adding: apt update && apt-key update && apt install -y --no-install-recommends

to deb.ubuntu.ccache.sh before autoconf is installed. That sort of worked but now it seems that asciidoc was not installed because:

/usr/bin/gcc -std=gnu99 -g -O2 -Wall -W -Werror -o ccache src/main.o src/args.o src/ccache.o src/cleanup.o src/compopt.o src/conf.o src/counters.o src/execute.o src/exitfn.o src/hash.o src/hashutil.o src/language.o src/lockfile.o src/manifest.o src/mdfour.o src/stats.o src/unify.o src/util.o src/version.o src/getopt_long.o src/hashtable.o src/hashtable_itr.o src/murmurhashneutral2.o src/snprintf.o -lm -lz

leezu commented 4 years ago

I believe ci/build.py is not suppported outside of the CI environment. So I'm closing the issue. Feel free to reopen if you believe it is / should be supported. Do you have any issues with cross-compile?

alopez1327 commented 4 years ago

Its just that from some of the discussions in the forum it was suggested to use this CI capability to build a library for the raspberry Pi. Cross compilation will take me a bit longer to do and since currently the documentation is missing from the master branch I was not sure which was the sanctioned way of generating the library for the Pi.

I did try to use the precompiled version (mxnet-1.5.0-py2.py3-none-any.whl) of the library but it does not support OpenCV.

Ill try cross-compilation then.

Thanks!

leezu commented 4 years ago

@alopez1327 can you link the relevant discussion? Maybe @marcoabreu can clarify how to use the Docker container build setup

marcoabreu commented 4 years ago

We used to show build.py as preferred build method on the devices-installation-dialog, but now it seems like it's gone.

Generally, the script should also work locally since it's dockerize. Especially considering that we're actively running it in CI, the error seems worth investigating. One thing I'm thinking of is that we're not running into it due to our caching, but once our cache breaks, we'll face the same error. Thus, I'd recommend to look into this error since otherwise it could knock off CI if the cache gets invalidated.

samskalicky commented 4 years ago

@zachgk assign @larroy

zachgk commented 4 years ago

@larroy Please comment if you want to take this issue.

larroy commented 4 years ago

I'm quite short with time but I can try to chime in.

larroy commented 4 years ago

I built the latest from master without any problems, did you update submodules?

time ci/build.py -p armv7
...
2019-11-21 22:04:06,620 - root - INFO - Waiting for status of container 8ff0cb9f431b for 600 s.
2019-11-21 22:04:07,090 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 0}
2019-11-21 22:04:07,091 - root - INFO - Container exited with success 👍
2019-11-21 22:04:07,091 - root - INFO - Stopping container: 8ff0cb9f431b
2019-11-21 22:04:07,092 - root - INFO - Removing container: 8ff0cb9f431b
0.71user 0.22system 3:34.27elapsed 0%CPU (0avgtext+0avgdata 65920maxresident)
nerdoid commented 4 years ago

I checked out master and updated the submodules, and encountered the same error originally reported by alopez1327. My steps:

git clone https://github.com/apache/incubator-mxnet
cd incubator-mxnet/
git submodule update --init --recursive
time ci/build.py -p armv7
...
The following packages will be upgraded:
  autoconf
1 upgraded, 1 newly installed, 0 to remove and 84 not upgraded.
Need to get 1169 kB of archives.
After this operation, 2502 kB of additional disk space will be used.
WARNING: The following packages cannot be authenticated!
  autoconf
E: There are problems and -y was used without --force-yes
The command '/bin/sh -c /work/deb_ubuntu_ccache.sh' returned a non-zero code: 100
Traceback (most recent call last):
  File "ci/build.py", line 454, in <module>
    sys.exit(main())
  File "ci/build.py", line 364, in main
    num_retries=args.docker_build_retries, no_cache=args.no_cache)
  File "ci/build.py", line 116, in build_docker
    run_cmd()
  File "/Users/[NAME]/dev/incubator-mxnet/ci/util.py", line 81, in f_retry
    return f(*args, **kwargs)
  File "ci/build.py", line 114, in run_cmd
    check_call(cmd)
  File "/Users/[NAME]/anaconda3/envs/[ENV]/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'build', '-f', 'docker/Dockerfile.build.armv7', '--build-arg', 'USER_ID=501', '--build-arg', 'GROUP_ID=20', '--cache-from', 'mxnetci/build.armv7', '-t', 'mxnetci/build.armv7', 'docker']' returned non-zero exit status 100.
      395.68 real         1.28 user         0.96 sys
leezu commented 4 years ago

This also affects the CI.

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17031/13/pipeline/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17031/14/pipeline/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17031/15/pipeline/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17031/16/pipeline/

My PR updates ubuntu_arm.sh script (https://github.com/apache/incubator-mxnet/pull/17031/commits/dc4e18a61d6c30b80f4265686f7f740aa8182973), thus the cached Docker container can't be reused and is rebuilt. The build fails due to this issue.

leezu commented 4 years ago

@larroy the reason that ci/build.py -p armv7 works is that it uses the cache. Would there be any problem to switch to the updated dockcross images? They recently moved to debian stretch: https://github.com/dockcross/dockcross/commit/aae501313ef206b769b0204067cebd1604029fcd

Where is mxnetcipinned/dockcross-linux-X maintained? I couldn't find it at https://github.com/apache/incubator-mxnet-ci/search?q=mxnetcipinned&unscoped_q=mxnetcipinned

It may be worth considering to directly track dockcross. Thereby we can avoid running into the case where our CI only continues to work due to a cache..

larroy commented 4 years ago

when we tracked dockcross directly it was breaking CI randomly. That's why it's pinned. Anyone can update this via PR.

larroy commented 4 years ago

This is correct, to reproduce container rebuild:

piotr@44-229-42-241:0: ~/mxnet [upstream_master]> time ci/build.py -p armv7 --no-cache 2>&1 | tee armv7.log

leezu commented 4 years ago

As part of https://github.com/apache/incubator-mxnet/issues/16753, the docker containers have been updated and switched to the upstream version. Thus this issue can be closed.

I opened https://github.com/apache/incubator-mxnet/issues/17151 to track pinning the upstream containers again @larroy