DavHau / mach-nix

Create highly reproducible python environments
MIT License
859 stars 105 forks source link

getting tensorflow to work on aarch64-linux #240

Closed mschwaig closed 3 years ago

mschwaig commented 3 years ago

I'm trying to get TensorFlow to work inside a Nix-based environment on an aarch64-linux system by using mach-nix.

I created a flake for this, which works fine on x86_64-linux but cannot resolve all of its dependencies on aarch64-linux. I'm not sure how to best deal with those missing dependencies and I was hoping someone can point me in the right direction.

The specific system I am trying this on is a Jetson Nano running Nvidia's Ubuntu-based distro.

I have created a repo with a minimal testcase for running TensorFlow here: https://github.com/mschwaig/tensorflow-hello-world-nix-flake.

This works on x86_64-linux and builds properly runnable thing, but on aarch64-linux both nix build and nix develop fail like this:

mschwaig@mschwaig-lamb:~/tensorflow-hello-world-nix-flake$ nix build  
warning: unknown setting 'experimental-features'
builder for '/nix/store/vfrbnxpz9zwzllhrp96zjngz6681wlr9-python3.8-google-auth-1.24.0.drv' failed with exit code 1; last 10 log lines:
  installing
  Executing pipInstallPhase
  /build/google-auth-1.24.0/dist /build/google-auth-1.24.0
  Processing ./google_auth-1.24.0-py2.py3-none-any.whl
  Requirement already satisfied: rsa<5,>=3.1.4 in /nix/store/zfhs95j11lardkrmdnzdga2k804j0d7s-python3.8-rsa-4.6/lib/python3.8/site-packages (from google-auth==1.24.0) (4.6)
  Requirement already satisfied: setuptools>=40.3.0 in /nix/store/a3npn2p0l2xpmsmcc8lnihr3kaipzws9-python3.8-setuptools-50.3.1/lib/python3.8/site-packages (from google-auth==1.24.0) (50.3.1.post0)
  Requirement already satisfied: cachetools<5.0,>=2.0.0 in /nix/store/dridwi27k9r0wk89kxj9jw8ffm06lcpb-python3.8-cachetools-4.2.1/lib/python3.8/site-packages (from google-auth==1.24.0) (4.2.1)
  Requirement already satisfied: pyasn1-modules>=0.2.1 in /nix/store/dp3lp9c61qbnr11v1zwqq24sszxrbisi-python3.8-pyasn1-modules-0.2.8/lib/python3.8/site-packages (from google-auth==1.24.0) (0.2.8)
  ERROR: Could not find a version that satisfies the requirement six>=1.9.0 (from google-auth)
  ERROR: No matching distribution found for six>=1.9.0
cannot build derivation '/nix/store/a8dwz3kbczp5j0rfvkj9rmfc4km159xa-python3-3.8.7-env.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/rv7rlh52q4bz8k3p34jbr19ra12fy168-python3.8-google-auth-oauthlib-0.4.2.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/c97sk5wlzkxxagdhnvzk9vdrrsb250v5-python3.8-tensorflow-tensorboard-2.4.0.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/6x9pd7v6zxpwy2ahx7h1li2arxq5mgz3-tensorflow-2.4.0.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/jyk69capr5j8pn74d3jwgvdi598d7g7a-python3.8-tensorflow-2.4.0.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/li75irpidlfl9dml8f2j86z0r7gvdbk6-python3.8-tensorflow-hello-world-0.1.0.drv': 1 dependencies couldn't be built
error: --- Error ------------------------------------------------------------------------------------- nix
build of '/nix/store/li75irpidlfl9dml8f2j86z0r7gvdbk6-python3.8-tensorflow-hello-world-0.1.0.drv' failed
mschwaig@mschwaig-lamb:~/tensorflow-hello-world-nix-flake$
DavHau commented 3 years ago

Hey, I'm not quite sure yet why the six dependency gets lost. (I will have a deeper look). But there is a simple workaround you can do. Just add six >= 1.9.0 to your requirements and add it as a dependency to google auth via override:

mach-nix.mkPython {
  ...
  _.google-auth.propagatedBuildInputs.mod = pySelf: _: oldVal: oldVal ++ [ pySelf.six ];
}
mschwaig commented 3 years ago

Oh, wow. That's a great snippet and it did solve that problem, thanks.

I'm now stuck at two illegal instruction errors, but I did not have the chance to look at the core dump yet and I'm not sure whats going on. It could be this issue or something similar, where the python binary would need some environment variable to be set to select the appropriate codepath.

Please feel free to close this issue if you feel like mach-nix's issue tracker is not an appropriate place to dig deeper into this, I would totally understand. Thanks!

mschwaig@mschwaig-lamb:~/tensorflow-hello-world-nix-flake$ nix build --keep-going
warning: Git tree '/home/mschwaig/tensorflow-hello-world-nix-flake' is dirty
warning: unknown setting 'experimental-features'
builder for '/nix/store/3ndjr70zf64m26f4aa21v20sgpfbk0jd-python3.8-h5py-3.1.0.drv' failed with exit code 132; last 10 log lines:
  creating build/lib.linux-aarch64-3.8/h5py/tests/test_vds
  copying h5py/tests/test_vds/test_highlevel_vds.py -> build/lib.linux-aarch64-3.8/h5py/tests/test_vds
  copying h5py/tests/test_vds/test_virtual_source.py -> build/lib.linux-aarch64-3.8/h5py/tests/test_vds
  copying h5py/tests/test_vds/__init__.py -> build/lib.linux-aarch64-3.8/h5py/tests/test_vds
  copying h5py/tests/test_vds/test_lowlevel_vds.py -> build/lib.linux-aarch64-3.8/h5py/tests/test_vds
  copying h5py/tests/data_files/vlen_string_s390x.h5 -> build/lib.linux-aarch64-3.8/h5py/tests/data_files
  copying h5py/tests/data_files/vlen_string_dset.h5 -> build/lib.linux-aarch64-3.8/h5py/tests/data_files
  copying h5py/tests/data_files/vlen_string_dset_utc.h5 -> build/lib.linux-aarch64-3.8/h5py/tests/data_files
  running build_ext
  /nix/store/ic98d5gbvxhjklifqg9gqnan3h1hkw2r-setuptools-setup-hook/nix-support/setup-hook: line 17:    22 Illegal instruction     (core dumped) /nix/store/a82rn0d51xyr47zad9abp0dihblzb9gk-python3-3.8.7/bin/python3.8 nix_run_setup bdist_wheel
builder for '/nix/store/gnv8z1q5zbnv758q0bckzq6myxq174kk-python3.8-scipy-1.6.0.drv' failed with exit code 132; last 10 log lines:
  unpacking source archive /nix/store/7zg9gv304kbvga11dm3a66m1ik40iclk-scipy-1.6.0.tar.gz
  source root is scipy-1.6.0
  setting SOURCE_DATE_EPOCH to timestamp 1609369918 of file scipy-1.6.0/PKG-INFO
  patching sources
  updateAutotoolsGnuConfigScriptsPhase
  configuring
  no configure script, doing nothing
  building
  Executing setuptoolsBuildPhase
  /nix/store/ic98d5gbvxhjklifqg9gqnan3h1hkw2r-setuptools-setup-hook/nix-support/setup-hook: line 17:    24 Illegal instruction     (core dumped) /nix/store/a82rn0d51xyr47zad9abp0dihblzb9gk-python3-3.8.7/bin/python3.8 nix_run_setup build_ext --fcompiler='gnu95' bdist_wheel
cannot build derivation '/nix/store/rzrrqm7yybhhv437543d6vhf1b60yafh-python3.8-Keras_Preprocessing-1.1.2.drv': 1 dependencies couldn't be built
builder for '/nix/store/hz5y61x87d23b41jrs6i74jrws50pdyc-python3.8-tensorflow-tensorboard-2.4.0.drv' failed with exit code 132; last 10 log lines:
  Rewriting #!/nix/store/a82rn0d51xyr47zad9abp0dihblzb9gk-python3-3.8.7/bin/python3.8 to #!/nix/store/a82rn0d51xyr47zad9abp0dihblzb9gk-python3-3.8.7
  wrapping `/nix/store/6f3knyk4zax74bwwf1kwzm2lfr384knw-python3.8-tensorflow-tensorboard-2.4.0/bin/tensorboard'...
  Executing pythonRemoveTestsDir
  Finished executing pythonRemoveTestsDir
  pythonCatchConflictsPhase
  pythonRemoveBinBytecodePhase
  pythonImportsCheckPhase
  Executing pythonImportsCheckPhase
  Check whether the following modules can be imported: tensorboard tensorboard.backend tensorboard.compat tensorboard.data tensorboard.plugins tensorboard.summary tensorboard.util
  /nix/store/4w6dxpvsgip2djk7b6s28xjqkhpk14s1-python-imports-check-hook.sh/nix-support/setup-hook: line 9:   710 Illegal instruction     (core dumped) /nix/store/a82rn0d51xyr47zad9abp0dihblzb9gk-python3-3.8.7/bin/python3.8 -c 'import os; import importlib; list(map(lambda mod: importlib.import_module(mod), os.environ["pythonImportsCheck"].split()))'
cannot build derivation '/nix/store/13q1dd1ikahphsfg6dyz69zajv802ryj-python3-3.8.7-env.drv': 4 dependencies couldn't be built
cannot build derivation '/nix/store/l8gvlk3pk3d3qf972wczwli868nrrgr7-tensorflow-2.4.0.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/jjf6ajgzi6vbapmjiljpqr8yhpl8nvd7-python3.8-tensorflow-2.4.0.drv': 4 dependencies couldn't be built
cannot build derivation '/nix/store/8kk2yxam2c8xl6agc4xjyzaxwp23m6d0-python3.8-tensorflow-hello-world-0.1.0.drv': 1 dependencies couldn't be built
error: --- Error ------------------------------------------------------------------------------------- nix
build of '/nix/store/8kk2yxam2c8xl6agc4xjyzaxwp23m6d0-python3.8-tensorflow-hello-world-0.1.0.drv' failed
DavHau commented 3 years ago

I'd like mach-nix to have good support for aarch64, therefore I'm definitely interested in fixing this. But I'm not sure haw much time I can allocate for this. If you could dig a bit deeper, I would appreciate it. Does the the tensorflow package from nixpkgs (without using mach-nix) work on aarch64?

DavHau commented 3 years ago

I just found the original problem. google-auth depends on six and the google-auth package in nixpkgs doesn't declare a dependency on six. On hydra, it builds, because tests are enabled and another sub-dependency declares six as a checkInput. But mach-nix disables tests by default and therefore six doesn't end up in the build environment.

This should be fixed in nixpkgs, but we could also consider to improve mach-nix to fix such mistakes automatically.

mschwaig commented 3 years ago

I'd like mach-nix to have good support for aarch64, therefore I'm definitely interested in fixing this. But I'm not sure haw much time I can allocate for this. If you could dig a bit deeper, I would appreciate it. Does the the tensorflow package from nixpkgs (without using mach-nix) work on aarch64?

I don't know I'll have to try it or ask on IRC.

EDIT: I want to get this working on aarch64, so I can look into it, but I cannot do it right away.

mschwaig commented 3 years ago

I just found the original problem. google-auth depends on six and the google-auth package in nixpkgs doesn't declare a dependency on six. On hydra, it builds, because tests are enabled and another sub-dependency declares six as a checkInput. But mach-nix disables tests by default and therefore six doesn't end up in the build environment.

This should be fixed in nixpkgs, but we could also consider to improve mach-nix to fix such mistakes automatically.

I think it's great to automatically find these kind of discrepancies so that they can be fixed, but an automated fix that just makes the problem go away entirely sounds like it could keep those issues buried and create a discrepancy between the written requirements and what actually happens.

mschwaig commented 3 years ago

I really like mach-nix and the examples.md it provides and that got me very far without being an expert in neither Nix nor Python. I will go a bit into how I looked into the first problem here in case it helps as a user's perspective.

This is the first set of issues that I run into where I did not know how to approach the problem on my own. Maybe I'm missing knowledge about how to inspect the dependency tree that mach-nix constructs. I did for example look through the relevant derivations and files in the store and I did look through pypi-deps-db searching for the relevant dependencies manually as well, but that did not help me figure out how things should fit together. From reading your comments on another issues I have the feeling I maybe should have used nix repl.

DavHau commented 3 years ago

Thanks for your PR. Why is it a draft?

Debugging dependency trees in nix ins't an easy thing. The nix cmdline tool has why-depends, but it only works for packages that build successfully and also it doesn't show you for what reason a package ended up in in the closure.

While debugging this problem yesterday, I decided to create my own helper library. Check out https://github.com/DavHau/nix-toolbox if you like. There is not a lot inside yet, but it has a function whyDepends. With this you can see why six ends up in the closure of google-auth.

I'm also planning on releasing a new mach-nix version, which makes it a bit simpler to get ahold of the underlying generated nix expression and dependency tree.

Other than that, whenever I need to debug the python code of mach-nix I use ./debug/debug.py, which executes mach-nix outside of nix-build, so you can hook in a debugger.

DavHau commented 3 years ago

This should be fixed in nixpkgs, but we could also consider to improve mach-nix to fix such mistakes automatically.

I think it's great to automatically find these kind of discrepancies so that they can be fixed, but an automated fix that just makes the problem go away entirely sounds like it could keep those issues buried and create a discrepancy between the written requirements and what actually happens.

BTW, I just noticed, that mach-nix does in fact attempt to fix missing dependencies automatically (see this function)

The problem why it didn't work in your case, is because the pypiData used is to old to contain the recent tensorflow 2.4.0 found in nixpkgs. If you update the flake input pypi-deps-db of mach-nix to a newer version, it should work without having to fix it manually.

But I also included a permanent fix for google-auth now.

mschwaig commented 3 years ago

Thanks for your PR. Why is it a draft?

I only verified that this indeed fixed my specific issue and that google-auth still builds afterwards, but I could not get it to fail building without the fix yet. I can remove the draft flag if you think it's fine like that. Debugging dependency trees in nix ins't an easy thing. The nix cmdline tool has why-depends, but it only works for packages that build successfully and also it doesn't show you for what reason a package ended up in in the closure.

While debugging this problem yesterday, I decided to create my own helper library. Check out https://github.com/DavHau/nix-toolbox if you like. There is not a lot inside yet, but it has a function whyDepends. With this you can see why six ends up in the closure of google-auth.

I just tried it and this looks really interesting. Thanks for publishing it! I will test it some more on my future problems.

I'm also planning on releasing a new mach-nix version, which makes it a bit simpler to get ahold of the underlying generated nix expression and dependency tree.

Other than that, whenever I need to debug the python code of mach-nix I use ./debug/debug.py, which executes mach-nix outside of nix-build, so you can hook in a debugger.

I had not seen that.

mschwaig commented 3 years ago

This should be fixed in nixpkgs, but we could also consider to improve mach-nix to fix such mistakes automatically.

I think it's great to automatically find these kind of discrepancies so that they can be fixed, but an automated fix that just makes the problem go away entirely sounds like it could keep those issues buried and create a discrepancy between the written requirements and what actually happens.

BTW, I just noticed, that mach-nix does in fact attempt to fix missing dependencies automatically (see this function)

The problem why it didn't work in your case, is because the pypiData used is to old to contain the recent tensorflow 2.4.0 found in nixpkgs. If you update the flake input pypi-deps-db of mach-nix to a newer version, it should work without having to fix it manually.

But I also included a permanent fix for google-auth now.

Oh, yeah. It adds dependencies that it got from other providers.

Does it only do that for nixpkgs? I wonder if that logic adding something is always a bug in nixpkgs or if there are intentional discrepancies sometimes.

It's also interesting to think that this logic is part of the interface.

In this case something this logic does should land in nixpkgs.

DavHau commented 3 years ago

Does it only do that for nixpkgs? I wonder if that logic adding something is always a bug in nixpkgs or if there are intentional discrepancies sometimes.

It only does that for nixpkgs. For sdist or wheel providers there is no need to fix anything, since the dependencies are taken from the database directly. It's the best information we have. If a package doesn't declare it's dependencies on pypi, there is not much we can do, other then including a custom patch for it.

In nixpkgs there are also intentional discrepancies. In nixpkgs focus lies to some extend on reducing the amount of different package versions. Therefore for some packages, wrong library versions + patches are used. But since mach-nix modifies the package set significantly, we cannot tell if all the hacks inside nixpkgs still work correctly. I think it's better to replace all deps with correct versions.

In this case something this logic does should land in nixpkgs.

I think, the thing that could be improved on nixpkgs, is to clearly separate building from testing into two separate derivations, so that test time deps are not part of the build. In this case, if you disable tests, it won't change anything with the build.

DavHau commented 3 years ago

Currently thinking what we could do to make it more clear to the user when mach-nix cannot find the dependencies of a nixpkgs package in the pypi data. I have the following in mind:

  1. Just print a warning to stdout during build. (But that might be easy to overlook)
  2. Make the build fail with an error message but allow the user to ignore the error by setting some flag
  3. Fail immediately whenever nixpkgs is never than pypiData and request the user to update the pypiData revision. Allow user to ignore this via some flag

I think option 3 would probably be the best approach, since an outdated pypiData is usually the reason for this problem.

mschwaig commented 3 years ago

For people using mach-nix from a flake it probably helps a lot to use the following pattern to avoid having an outdated version of pypi-deps-db in the first place.

Since pypi-deps-db is an input to the mach-nix flake, you should add an explicit dependency to pypi-deps-db to your flake and keep it up to date with the latest version of pypi-deps-db, so that dependency resolution can rely on the latest available version information for dependency resolution.

{
  ...
 inputs = {
   nixpkgs.url = "github:nixos/nixpkgs/nixpkgs-unstable";
   pypi-deps-db = {
     url = "github:DavHau/pypi-deps-db";
     flake = false;
   };
   mach-nix = {
     url = "github:DavHau/mach-nix/3.1.1";
     inputs.pipy-deps-db.url = "pypi-deps-db";
   };
 };
 ...
}

Of course this is only helps users that are using flakes, but it does the right thing for nix flake update which is quite nice. And I think its a somewhat natural way to express that you want to keep a transitive dependency up to date. I also use it to nixpkgs and flake-utils up to date in my mach-nix projects.

PS: have not had time yet to look into the remaining aarch64-linux issues yet

mschwaig commented 3 years ago

The cause of the illegal instruction errors I described above https://github.com/DavHau/mach-nix/issues/240#issuecomment-785874276 is indeed the same as described in the linked numpy issue.

Effectively things that depend on nixpkgs's openblas, like nixpkg's numpy probably fail with an illegal instruction error on aarch64 right now.

This should be fixed when https://github.com/NixOS/nixpkgs/pull/117004 lands in nixpkgs as it fixes the bad machine code that causes the issue.

Open this for an alternative workaround. Setting `OPENBLAS_CORETYPE=ARMV8` explicitly instead prevents the broken code from being run an alternative way bypassing the issue. I have not found a good way to apply that workaround so that the appropriate process sees that environment variable, it's only applied for the appropriate platform and it does not require rebuilding pretty much anything downstream from openblas anyways.

I'm building tensorflow on my Jetson Nano right now.I will update this issue when I know if things are working now.

mschwaig commented 3 years ago

Since https://github.com/NixOS/nixpkgs/pull/117004 was merged, to staging and then eventually to master, the test project in my repo is building now. It takes quite some time to build tensorflow for aarch64-linux though. I have not checked again if I can get a binary artifact from some provider.

I think this can be closed now so I'm doing that. Thanks for your help @DavHau.