esa / torchquad

Numerical integration in arbitrary dimensions on the GPU using PyTorch / TF / JAX
https://www.esa.int/gsp/ACT/open_source/torchquad/
GNU General Public License v3.0
189 stars 40 forks source link

Tests failing on GPU #178

Closed gomezzz closed 1 year ago

gomezzz commented 1 year ago

Issue

Problem Description

Related to Release 0.4.0, can be fixed directly on release branch.

Currently the tests fail on GPUs, for torch because of a missing transfer from GPU memory to host. For TF there seems to be a breaking API change ( ImportError: cannot import name 'np_config' from 'tensorflow.python.ops.numpy_ops' , see https://stackoverflow.com/questions/75727569/cannot-import-name-np-config-from-tensorflow-python-ops-numpy-ops )

Logs:

pytorch_gpu: pytest_gpu0.log

TF_gpu, torch_cpu (failed to get JAX working since this was on a win machine): pytest_all_gpu0.log

Setting up an env to check with all frameworks on GPU also proved time-consuming. (and failed for jax)

Expected Behavior

What Needs to be Done

How Can It Be Tested or Reproduced

Run pytest , used following env

name: torchquad_all
channels:
  - anaconda
  - conda-forge
  - pytorch
  - nvidia
dependencies:
  - autoray>=0.2.5
  - loguru>=0.5.3
  - matplotlib>=3.3.3
  - pytest>=6.2.1
  - python>=3.8
  - scipy>=1.6.0
  - sphinx>=3.4.3
  - tqdm>=4.56.0
  # Numerical backend installations with CUDA support where possible:
  - numpy>=1.19.5
  - pytorch>=1.9 # CPU version
  - tensorflow-gpu
  # jaxlib with CUDA support is not available for conda
  - pip:
      - --find-links https://storage.googleapis.com/jax-releases/jax_releases.html
      - jax[cpu]>=0.2.22 # this will only work on linux. for win see e.g. https://github.com/cloudhan/jax-windows-builder
        # CPU version
ilan-gold commented 1 year ago

@gomezzz Can we just release a version with dependencies pinned against a <= as well?

ilan-gold commented 1 year ago

I will at least look into the torch problem now.

gomezzz commented 1 year ago

@gomezzz Can we just release a version with dependencies pinned against a <= as well?

@ilan-gold We can but I don't think it would be ideal because if somebody goes

conda install tensorflow they will get the newest version

thus, the prospective following `conda install torchquad' would lead to either a forced downgrade or no compatible version being found. Worst case we can do it, but I don't think the APIs are likely to have changed much. Also, I think we only directly import from the frameworks in the tests (and maybe one or two other places?) to avoid problems like this.

I will at least look into the torch problem now.

Thank you! :pray: If lack of a GPU is the problem running in google collab may be an option (haven't tried though), otherwise I might be able to take a look next week some time :)

ilan-gold commented 1 year ago

That's a good point. I cant run latest torch actually on the hardware I have access to because the Cuda version is too low (if I remember correctly).

ilan-gold commented 1 year ago

I cannot reproduce on colab, although I get a different somewhat stranger error that seems like it might not actually be an error since the numerical-error reported is still very good (presumably hardware/torch version specific, but not the error you're seeing in the logs): https://colab.research.google.com/drive/1lFpdtY5zV7VpW88aazedA3n4khedHDQP?usp=sharing :( very bad

ilan-gold commented 1 year ago

Also @gomezzz can you point to where the tests are failing? I don't see anything in the actions

ilan-gold commented 1 year ago

If this was your computer, it could be specific to something there.

gomezzz commented 1 year ago

@ilan-gold Thanks for the efforts! I will try to run precisely your code on my machine next week to see if that can help pin it down. (and to confirm I didn't mess up the setup etc. :D)

ilan-gold commented 1 year ago

There's some colab-specific stuff in there, but not sure it's any different that what you would do. I just clone, check out the release branch, install deps (after deleting the named env from the .yml file because you can only use base on colab), and then pytest

ilan-gold commented 1 year ago

Looking at this again, I used encironment_backends_all.yml not what you posted here. Imll try it again

ilan-gold commented 1 year ago

Ok, that didn't change the outcome. Sorry @gomezzz :(

gomezzz commented 1 year ago

@ilan-gold running your notebook in colab, I get

FAILED gauss_test.py::test_integrate_torch - assert (3 > 3 or 7.105427357601002e-15 < 2e-16)
FAILED gauss_test.py::test_integrate_tensorflow - assert (3 > 3 or 7.105427357601002e-15 < 2e-16)

Was this what you got?

On one of our GPU servers I get

FAILED gauss_test.py::test_integrate_torch - assert (3 > 3 or 7.105427357601002e-15 < 2e-16)
FAILED gauss_test.py::test_integrate_tensorflow - assert (3 > 3 or 7.105427357601002e-15 < 2e-16)
FAILED gradient_test.py::test_gradients_torch - AssertionError: assert 0.11964358930767993 < 0.1

so seems the test bounds are bit too harsh for gauss and the gradient test?

I also get 115 warnings, oof. We might wanna look at those at some point :D :see_no_evil:

gomezzz commented 1 year ago

(I don't get the errors I faced previously on either, so I think probably something went wrong setting up the env before.)

ilan-gold commented 1 year ago

@gomezzz Yes these are the sorts of errors I saw. Should we bump the tolerance? If it's passing here, is it a problem? I guess so since tests are made to be run locallly

gomezzz commented 1 year ago

@gomezzz Yes these are the sorts of errors I saw. Should we bump the tolerance? If it's passing here, is it a problem? I guess so since tests are made to be run locallly

yea let's increase, I'd always aim to have passing tests on GPUs too :)