canonical / docker-snap

https://snapcraft.io/docker
MIT License
54 stars 27 forks source link

NVIDIA support - improvements relating to runtime config and CDI #157

Closed jocado closed 6 months ago

jocado commented 10 months ago

This enhances the nvidia runtime configuration and support, and enforces use of CDI by default.

After the versions of nvidia related parts were increased to 13.5 , the CDI generation was broken. This led to a deeper review of the nvidia runtime configuration options, and some resulting proposed enhancements. This also included raising a bug in the nvidia-container-toolkit, which resulted in the introduction of specified support for library-search-paths, which fixes the CDI generation to properly support Ubuntu Core, which will land in the v1.15.0 release.

Summary of changes:

The following snap config options currently supported:

jocado commented 10 months ago

@lucaskanashiro

Please take a look at this and review. Currently, it's using a release candidate for the container toolkit parts, which I would plan to update once once the proper release tag is available upstream, and everything should in in place to at least review.

Thank you :+1:

lucaskanashiro commented 9 months ago

Sorry for the delay @jocado , I was on vacation. I am planning to take a look at this next week.

lucaskanashiro commented 9 months ago

@jocado unfortunately, some urgent issues came up on my side and I did not manage to review this yet, I might need a couple of weeks to get to it. Sorry again.

jocado commented 9 months ago

Hi @lucaskanashiro

Thanks for coming back to let me know. I do appreciate that. I'm keen to get this moving, so as soon as you are able to review it I would be grateful.

jocado commented 8 months ago

Just a note that the nvidia toolkit revision v1.15.0 still hasn't actually been released, so we can't quite merge this yet anyway it seems. Hopefully it will appear over the next week or two. In the meantime I have bumped to the latest RC, v1.15.0-rc.2

lucaskanashiro commented 7 months ago

@jocado thanks for your patience, I am starting to take a look at this PR right now. Did they release version 1.15.0 of nvidia toolkit? Apparently not yet.

jocado commented 7 months ago

Hi @lucaskanashiro

Thanks for coming back to me.

Unfortuanly it's still not released, which is why I haven't followed up again in the meantime. It's currently rc4:

jocado commented 7 months ago

Would you be apposed to going over a review in the meantime, on the basis that once the release is made I can just update the MR with that single change and ask you to merge ?

lucaskanashiro commented 7 months ago

Would you be apposed to going over a review in the meantime, on the basis that once the release is made I can just update the MR with that single change and ask you to merge ?

No, let's do that. I'll go through your changes now. We fix whatever is needed and get it ready to merge.

lucaskanashiro commented 7 months ago

I made an initial pass mostly taking a look at the commits and I think we need to fix your commit history before being ready to merge this branch. There are some commits applying changes and later on undoing them, or moving them to somewhere else. We need to consolidate all those changes in a clean and concise git commit history, so we do not pollute the main branch.

I took some notes that could help you improve the commits here:

lucaskanashiro commented 7 months ago

The remaining changes look sane to me. I do not have a nvidia graphic card here to properly test everything, but I guess you already confirmed that everything is working as expected, right?

jocado commented 7 months ago

Hi @lucaskanashiro

I took some notes that could help you improve the commits here:

Thanks for those, and apologies if the history is not as you would expect. I'm not so use dot having having to clean things up so much, and appreciate the pointers.

I've had a go at rebasing. PTAL and let me know if you'm close or not :smile:

jocado commented 7 months ago

The remaining changes look sane to me. I do not have a nvidia graphic card here to properly test everything, but I guess you already confirmed that everything is working as expected, right?

Yes, I have validated on the hardware we have for testing several different models of nvidia card we have for testing, with some cuda test containers we have for different cuda versions.

jocado commented 7 months ago

Apologies @lucaskanashiro - I also just added one final change to make sure nvidia-smi is added to the CDI config, but that should be it. I squashed it against the central lib change.

lucaskanashiro commented 7 months ago

@jocado thanks, it looks way better now :)

I think we can merge this once you remove the draft label.

jocado commented 7 months ago

Great - thank you @lucaskanashiro

It could well be another week or so, as the most recent RC was released this week, and I think they like to soak them for 2 weeks. I will ping you when it's released and I've updated the PR.

Thanks for taking a look thus far :+1:

jocado commented 6 months ago

@lucaskanashiro The nvidia toolkit version was finally released yesterday.

I've updated the version. Do you want me to try and squash that commit, and change anything else, or are we good to merge ?

lucaskanashiro commented 6 months ago

Thanks @jocado !

Could you squash 663c5e2478c36eb6dd436d3ba9bdc5de9992f15b and 0f4e7c2a2c8664c8a782da2389a10bb41d6e022f ?

In 4230a794976df6368d141da6181982f8fe68ff57 , we also have some typos in the commit message. realted funtion should be replaced by related functions, am I right?

jocado commented 6 months ago

@lucaskanashiro Sure, that's done. PTAL.

lucaskanashiro commented 6 months ago

@jocado the typo in the commit message of c7a15f635037aa16500bf19c1e6bc4a2848f4aef is still there. Could you fix it?

jocado commented 6 months ago

Ahh, sorry I really thought I had. Not sure what happened there. Done.

PTAL @lucaskanashiro

lucaskanashiro commented 6 months ago

LGTM! Can we merge this now? I removed the Draft status and marked it ready for review, but I'd like your +1 before merging the changes.

jocado commented 6 months ago

:+1: from - it's ready when you are :)

lucaskanashiro commented 6 months ago

There is revision 2925 in the latest/edge channel with those changes, please test it!