Closed elliottslaughter closed 2 years ago
@janden @garrettwrong Dear Joakim & Garrett, do either of you have a few minutes to look into what's happening here? Since we're starting to get a bunch of users, it would be great to have CI? (of course, various devs like Melody, Johannes, and occasionally me, are regularly checking it locally..) It fails at the Py import stage, ie it's an env issue not a code issue:
from pycuda._driver import * # noqa
E SystemError: initialization of _driver raised unreported exception
I know nothing about the jenkins GPU setup... Thanks, Alex
So it looks like there are two issues here: – CI is failing currently due to NumPy version mismatches. This should be handled by #129. – We are currently only running the Python tests, not the C++ tests.
Regarding the issue of the C++ tests, I think that our idea was that the Python tests could cover most of the code, so any bugs would trip them. That being said, I don't know that that is achieved, so it might make sense to include them. @ahbarnett, what do you think?
If we do want to include them, we have to see how that might best be achieved. It could be here in the Jenkinsfile or it could be in the Dockerfile where the library is being built (and the other calls to make
are performed).
Actually I have a question about this for @janden: does the Jenkins test (now using python3 -m pip install -e
)
actually compile libcufinufft.so
locally, or are this precompiled and downloaded from somewhere else?
It the .so is locally compiled, then I think the py tests are good enough for CI. The tests in make check
are more extensive, less supportable, and let's keep them for devs only.
But @elliottslaughter, do you suspect particular make check
items of not working on the Jenkins system?
Maybe now since commit 46a8e8a fixes the environment, you can upstream fetch that to your branch anyway.
I think I was confused about the make check
because I expected an issue like https://github.com/flatironinstitute/cufinufft/pull/116#issuecomment-1030468068 to break the build, but it doesn't. (I suspect it may be a GCC vs Clang issue, so the default CUDA setup with GCC doesn't catch it.)
Anyway, this is your project so feel free to do as you see fit, but my own two cents would be that the more tests, the better. Even if the Python and C++ tests overlap, make check
is at least testing a different language interface. The CI is the perfect place to put this since you don't need to remember to run it, and it helps sanity check that everything stays working.
I rebased and the run passes, but maybe I messed the Jenkins syntax because I can't actually see the make check
running anywhere in the job output.
Thanks Elliott for finding the MAX_NF bug, and we will aim to add make check to CI as you suggest. Best, Alex
On Sat, Feb 5, 2022 at 1:51 AM Elliott Slaughter @.***> wrote:
I think I was confused about the make check because I expected an issue like #116 (comment) https://github.com/flatironinstitute/cufinufft/pull/116#issuecomment-1030468068 to break the build, but it doesn't. (I suspect it may be a GCC vs Clang issue, so the default CUDA setup with GCC doesn't catch it.)
Anyway, this is your project so feel free to do as you see fit, but my own two cents would be that the more tests, the better. Even if the Python and C++ tests overlap, make check is at least testing a different language interface. The CI is the perfect place to put this since you don't need to remember to run it, and it helps sanity check that everything stays working.
I rebased and the run passes, but maybe I messed the Jenkins syntax because I can't actually see the make check running anywhere in the job output.
— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/cufinufft/pull/125#issuecomment-1030563064, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSUCHPHRQ4ADXSNKHQLUZTCILANCNFSM5NB3KEXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
Is it possible that you're running a centrally-controlled Jenkinsfile
, rather than the one in the repo? I tried another version of my change, and it literally isn't taking the script I'm putting in. I'm starting to suspect it's not actually running the Jenkinsfile
out of the repository at all but using a centrally managed version, or something along those lines.
I'm pinging @janden on this, since he set it up. Best, Alex
On Wed, Feb 9, 2022 at 1:41 AM Elliott Slaughter @.***> wrote:
Is it possible that you're running a centrally-controlled Jenkinsfile, rather than the one in the repo? I tried another version of my change, and it literally isn't taking the script I'm putting in. I'm starting to suspect it's not actually running the Jenkinsfile out of the repository at all but using a centrally managed version, or something along those lines.
— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/cufinufft/pull/125#issuecomment-1033403171, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSQXZHZ2SSQN7QQRRP3U2IEANANCNFSM5NB3KEXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
Actually I have a question about this for @janden: does the Jenkins test (now using
python3 -m pip install -e
) actually compilelibcufinufft.so
locally, or are this precompiled and downloaded from somewhere else?
It's compiled as part of the Dockerfile each time the CI is run.
Anyway, this is your project so feel free to do as you see fit, but my own two cents would be that the more tests, the better. Even if the Python and C++ tests overlap,
make check
is at least testing a different language interface. The CI is the perfect place to put this since you don't need to remember to run it, and it helps sanity check that everything stays working.
I agree.
Is it possible that you're running a centrally-controlled Jenkinsfile, rather than the one in the repo?
I'm not sure. It could be that it runs the one in the master branch (for safety reasons). That being said, I'm fairly certain that the Dockerfile is being updated from the branch, so you may have better luck putting in the make check
line there. I also think it makes more sense to put it there since that's where the other calls to make
are.
Good news: running make check
in the Dockerfile does indeed cause it to run.
Bad news: it is now hitting:
../bin/spread1d_test: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
I'm not really familiar with how these containers are set up, so I'm not sure where this library would be located?
I'm not really familiar with how these containers are set up, so I'm not sure where this library would be located?
This line https://github.com/flatironinstitute/cufinufft/blob/bb1453dfe9dc12159e8e346eae79ad4d71fd566f/ci/docker/cuda10.1/Dockerfile-x86_64#L72 sets the necessary environment variable. Try invoking make check
after it.
@janden I believe the most recent change I made matches what you suggested, but the Jenkins build still fails.
Looks like the LD_LIBRARY_PATH
is still incorrect. Let me try something.
So I think I know what's happening. The way the docker image is set up, it doesn't actually have access to the GPU during the build phase. At this point, all that's happening is that the library is being compiled, so it doesn't actually need the GPU. Later, when we're testing, we do use the GPU, which is why Jenkins runs the docker image with the --gpus 1
flag set (otherwise we'd see the same crash during pytest
).
It may be possible to give the docker image access to the GPU during the build phase, but this seems complicated. I therefore suggest you move the make check
invocation back into the Jenkinsfile. Then we get back to the original problem that Jenkins won't run it until it's in master, but we'll just have to merge it in and see what happens.
Ok, I have reset this PR to its original form, adding the make check
line to Jenkinsfile.
However, I really think you should check the Jenkins configuration first. As you can see at the link below, one of the options to configure the contents of a Jenkinsfile is through the admin UI. If this is how the instance is currently set up, it literally won't matter what is in the master
branch (or anything else, aside from the UI).
https://www.jenkins.io/doc/book/pipeline/getting-started/#through-the-classic-ui
You'll want to double check that you've followed these instructions to make sure you're pulling the Jenkinsfile from the repository itself:
https://www.jenkins.io/doc/book/pipeline/getting-started/#defining-a-pipeline-in-scm
I don't see anything in there that would suggest that it only pays attention to certain branches (which is why I'm suggesting double checking that it's not configured via the UI).
I don't have admin access to Jenkins, so there is no good way for me to verify or modify these settings.
It looks like Jenkins doesn't trust the Jenkinsfile in this PR (see https://jenkins.flatironinstitute.org/job/cufinufft/indexing/events), but it does trust the one from master. It might trust it if the PR comes from this repository. Will give it a try. Otherwise it won't hurt to just merge it.
Ok. Looks like my hypothesis was correct.
I suggest we close this PR, undraft #134, and continue there. Since it's a branch on the main repository, Jenkins trusts the Jenkinsfile, so we can make any changes we want there and it should show up in the CI.
Looks like make check
ran in master. Thanks all!
I noticed that
make check
is not being run in Jenkins. I'm adding it because I suspect there are failures in the test suite that aren't being caught at the moment.