GPU python kernel for UC3 / UC4

FAIRiCUBE / FAIRiCUBE-Hub-issue-tracker

FAIRiCUBE HUB issue tracker

Creative Commons Zero v1.0 Universal

0 stars 1 forks source link

GPU python kernel for UC3 / UC4 #75

Open jetschny opened 2 months ago

jetschny commented 2 months ago

we would like to request a small GPU python kernel for UC3 / UC4 under the EOX Lab environment which will further test headless execution functionality and the the provisioning of GPU resources to the UCs. one multi-GPU machine with e.g. 8x A100 would needed for testing and semi-production.

BachirNILU commented 2 months ago

As we understand from @Schpidi 's comment:

"The _node_purpose can either have the value user to use a CPU node with 2 CPUs and 8 GB memory or userg1_ to use a GPU node with 4 CPUs, 16 GB memory, and a Tesla T4 GPU with 16 GB memory."

The _node_purpose takes only two values user or userg1_ (if we need GPUs). We are wondering if there is another value/option with higher GPU configuration?

Schpidi commented 2 months ago

Correct, node_purpose currently supports two values: user for a CPU instance and userg1 for a GPU one. For the GPU one the AWS EC2 type g4dn.xlarge is used providing one NVIDIA T4 GPU. This type costs less than 1€ per hour.

I believe we can configure an additional instance type like for example a p4d.24xlarge providing 8 NVIDIA Tesla A100 GPUs but the costs are more than 40€ per hour. Please also note that we have no particular experience with multi-GPU machines.

Anyway, I'll request the configuration from our DevOps team and keep you informed.

eox-cs1 commented 1 month ago

All the headless endpoints can be started either with:

node_purpose "user" (for regular CPU)
node_purpose "userg1" (for single GPU) or if node_purpose is not passed then "userg1" is default

In addition the smallest Multi GPU VM available on eu-central-1 g4dn.12xlarge (4 x NVIDIA T4 16 GiB) -> $4.89 per hour on "userg2" has been configured. Only UC3 (eurodatacube17) and UC4 (eurodatacube18) are whitelisted for using "userg2".

For further usage (syntax) information see: https://github.com/FAIRiCUBE/FAIRiCUBE-Hub-issue-tracker/issues/70