exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
16.33k stars 866 forks source link

Unhealthy nodes may force host switch to CLANG. CUDA=1 will not enforce behavior #395

Open FFAMax opened 1 month ago

FFAMax commented 1 month ago

Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183

This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.

AlexCheema commented 1 month ago

CLANG and CUDA peers that use tinygrad are compatible.

FFAMax commented 4 weeks ago

CLANG and CUDA peers that use tinygrad are compatible.

The case: on one node due some glitch GPU stopped responding (aka driver issue, GPU(HW) issue, etc), this make entire network unable to proceed until node detected and manually turned off.

By other words: can spoof incompatible peer and make network inoperationable