Closed jocado closed 6 months ago
@lucaskanashiro
Please take a look at this and review. Currently, it's using a release candidate for the container toolkit parts, which I would plan to update once once the proper release tag is available upstream, and everything should in in place to at least review.
Thank you :+1:
Sorry for the delay @jocado , I was on vacation. I am planning to take a look at this next week.
@jocado unfortunately, some urgent issues came up on my side and I did not manage to review this yet, I might need a couple of weeks to get to it. Sorry again.
Hi @lucaskanashiro
Thanks for coming back to let me know. I do appreciate that. I'm keen to get this moving, so as soon as you are able to review it I would be grateful.
Just a note that the nvidia toolkit revision v1.15.0
still hasn't actually been released, so we can't quite merge this yet anyway it seems. Hopefully it will appear over the next week or two. In the meantime I have bumped to the latest RC, v1.15.0-rc.2
@jocado thanks for your patience, I am starting to take a look at this PR right now. Did they release version 1.15.0
of nvidia toolkit? Apparently not yet.
Hi @lucaskanashiro
Thanks for coming back to me.
Unfortuanly it's still not released, which is why I haven't followed up again in the meantime. It's currently rc4
:
Would you be apposed to going over a review in the meantime, on the basis that once the release is made I can just update the MR with that single change and ask you to merge ?
Would you be apposed to going over a review in the meantime, on the basis that once the release is made I can just update the MR with that single change and ask you to merge ?
No, let's do that. I'll go through your changes now. We fix whatever is needed and get it ready to merge.
I made an initial pass mostly taking a look at the commits and I think we need to fix your commit history before being ready to merge this branch. There are some commits applying changes and later on undoing them, or moving them to somewhere else. We need to consolidate all those changes in a clean and concise git commit history, so we do not pollute the main branch.
I took some notes that could help you improve the commits here:
Commits 945f4e22bcf374ea2a0e96f94f5ccca0388294e3 and fc01b6b1972a898e3ae8cc196f7b71a2abfa450e are not needed, we can remove them from the commit history.
Changes in bin/nvidia-container-toolkit should be reviewed and maybe squashed into a single commit, many of them apply changes and later they are moved to other files. Such as commits cbc98bfed82fedfdb787d5c232e431667fca7ea5 and fc01b6b1972a898e3ae8cc196f7b71a2abfa450e.
Let's try to avoid commits like "Typo" from the git commit history. Please fix up the needed commit.
The remaining changes look sane to me. I do not have a nvidia graphic card here to properly test everything, but I guess you already confirmed that everything is working as expected, right?
Hi @lucaskanashiro
I took some notes that could help you improve the commits here:
Thanks for those, and apologies if the history is not as you would expect. I'm not so use dot having having to clean things up so much, and appreciate the pointers.
I've had a go at rebasing. PTAL and let me know if you'm close or not :smile:
The remaining changes look sane to me. I do not have a nvidia graphic card here to properly test everything, but I guess you already confirmed that everything is working as expected, right?
Yes, I have validated on the hardware we have for testing several different models of nvidia card we have for testing, with some cuda test containers we have for different cuda versions.
Apologies @lucaskanashiro - I also just added one final change to make sure nvidia-smi
is added to the CDI config, but that should be it. I squashed it against the central lib change.
@jocado thanks, it looks way better now :)
I think we can merge this once you remove the draft label.
Great - thank you @lucaskanashiro
It could well be another week or so, as the most recent RC was released this week, and I think they like to soak them for 2 weeks. I will ping you when it's released and I've updated the PR.
Thanks for taking a look thus far :+1:
@lucaskanashiro The nvidia toolkit version was finally released yesterday.
I've updated the version. Do you want me to try and squash that commit, and change anything else, or are we good to merge ?
Thanks @jocado !
Could you squash 663c5e2478c36eb6dd436d3ba9bdc5de9992f15b and 0f4e7c2a2c8664c8a782da2389a10bb41d6e022f ?
In 4230a794976df6368d141da6181982f8fe68ff57 , we also have some typos in the commit message. realted funtion
should be replaced by related functions
, am I right?
@lucaskanashiro Sure, that's done. PTAL.
@jocado the typo in the commit message of c7a15f635037aa16500bf19c1e6bc4a2848f4aef is still there. Could you fix it?
Ahh, sorry I really thought I had. Not sure what happened there. Done.
PTAL @lucaskanashiro
LGTM! Can we merge this now? I removed the Draft status and marked it ready for review, but I'd like your +1 before merging the changes.
:+1: from - it's ready when you are :)
There is revision 2925
in the latest/edge
channel with those changes, please test it!
This enhances the nvidia runtime configuration and support, and enforces use of CDI by default.
After the versions of nvidia related parts were increased to
13.5
, the CDI generation was broken. This led to a deeper review of the nvidia runtime configuration options, and some resulting proposed enhancements. This also included raising a bug in the nvidia-container-toolkit, which resulted in the introduction of specified support forlibrary-search-paths
, which fixes the CDI generation to properly support Ubuntu Core, which will land in thev1.15.0
release.Summary of changes:
v1.15.0
The following snap config options currently supported:
nvidia-support.disabled
nvidia-support.runtime.config-override
nvidia-support.cdi.device-name-strategy