AccelerateHS / accelerate-tensorflow

BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

Build/dependency docs #30

Open mikesperber opened 1 year ago

mikesperber commented 1 year ago

We need more detailed docs on what dependencies need to be installed, what the version constraints are, and what library-path issues exist.

Maybe Bianca can add what she knows from her setup here.

tomsmeding commented 1 year ago

For what it's worth, my current build instructions are encoded in this Makefile and the code block in the readme here.

sowilo commented 1 year ago

In my setup I simply followed the readme, installed all dependencies mentioned there, and supplied extra-lib-dirs and extra-include-dirs to stack. With this I was able to build the package.

The tests were trickier: The nofib-tensorflow test suite depends on the accelerate nofib test suite which is disabled by default. I couldn't figure out how to instruct stack to reinstall accelerate with the appropriate flag so I gave up on that one. I managed to run the second test suite (nofib-tensorflow-lite) but had to patch the Tensorflow.Lite.Compile module, namely function tflite_model. The tensorflow C-bindings in version 2.3.0 didn't agree with the tensorflow version installed on my system, so I had to mask the environment variable LD_LIBRARY_PATH when running the python process; in my case passing env = Just [("LD_LIBRARY_PATH", "")] when starting the process did the trick. I didn't manage to run the whole test suite, though. The tests run for a while but fail at some point with Failed to allocate TPU tensors. But this might be due to limitations of the TPU hardware I used in my experiments (Coral USB Accelerator).

tomsmeding commented 1 year ago

Thank you, good to hear that at least you got something working without too many code changes!

The tests were trickier: The nofib-tensorflow test suite depends on the accelerate nofib test suite which is disabled by default.

For completeness, because the project was using stack, this would end up in stack.yaml; you'd have needed to add the following block:

flags:
  accelerate:
    nofib: True

I'm not sure why that was not already there, perhaps because most of the Accelerate test suite (which nofib-accelerate would run) actually fails because many primitives are not yet implemented in this TF backend (and perhaps will never be).

On my fork the project is now using cabal, not stack, where this goes in cabal.project (i.e. here).

I managed to run the second test suite (nofib-tensorflow-lite) but had to patch the Tensorflow.Lite.Compile module, namely function tflite_model. The tensorflow C-bindings in version 2.3.0 didn't agree with the tensorflow version installed on my system, so I had to mask the environment variable LD_LIBRARY_PATH when running the python process; in my case passing env = Just [("LD_LIBRARY_PATH", "")] when starting the process did the trick.

On my fork, we're now building TF ourselves from a submodule, and using that not only for the Haskell code but also to create a Python virtualenv from to run converter.py in. Hence your dance should not be necessary any more: all are using the same version now.

I didn't manage to run the whole test suite, though. The tests run for a while but fail at some point with Failed to allocate TPU tensors. But this might be due to limitations of the TPU hardware I used in my experiments (Coral USB Accelerator).

Would this happen already with the first test or only halfway through the test suite? This looks exactly like the error we got before we realised we needed to add udev rules to give ourselves access to the USB device; straceing the test suite showed that it was getting "Permission denied" errors when accessing the USB device.

In our case, we have the following:

$ cat /etc/udev/rules.d/99-edgetpu-accelerator.rules
SUBSYSTEM=="usb",ATTRS{idVendor}=="1a6e",GROUP="edgetpu"
SUBSYSTEM=="usb",ATTRS{idVendor}=="18d1",GROUP="edgetpu"

and then make sure your Linux user is in the edgetpu group. The udev rules were already in place from an earlier person on the project, so I'm not sure whether 1a6e is also necessary -- our TPU seems to expose ID 18d1.

sowilo commented 1 year ago

Would this happen already with the first test or only halfway through the test suite? This looks exactly like the error we got before we realised we needed to add udev rules to give ourselves access to the USB device; straceing the test suite showed that it was getting "Permission denied" errors when accessing the USB device.

The failure occurs usually halfway through the test suite. I will try adapting the udev rules and give it another go.

tomsmeding commented 1 year ago

@sowilo I've successfully built the repository in a "minimized" Ubuntu Server virtual machine. The required commands can be found in ubuntu-build-instructions.txt.

The commands in that file describe the things that are not manually, locally built by the Makefile though. I'm not sure that's the granularity that you want.

I'm happy to process these instructions more and get this to a list of dependencies that you can try to put into Nix, but perhaps you're faster at that anyway.