Open FFAMax opened 1 month ago
CLANG and CUDA peers that use tinygrad are compatible.
CLANG and CUDA peers that use tinygrad are compatible.
The case: on one node due some glitch GPU stopped responding (aka driver issue, GPU(HW) issue, etc), this make entire network unable to proceed until node detected and manually turned off.
By other words: can spoof incompatible peer and make network inoperationable
Immediately after restart host may switch from GPU/CUDE device to CLANG and nothing can do except shutdown other peers to find who causing the issue. On unhealthy host issue detected by basic health checks:
This condition broke entire network. Probably need additional checks to filter peers based on some criteria to reject incompatible peers.