Open Kreyren opened 3 years ago
Help-wanted: Suggestions to handle the VRAM overheating appreciated.
DISCLAIMER: Don't do this, not to be used for training.
SOLVED: Conductive dust was causing a signal to hug up
I put it in a sonic cleaner which didn't fix the issue, but seemed to reduce the amount of fragmentation on the screen.. Did two more passes without any major improvement..
So i did this
then back to sonic cleaner filled with isopropyl alcohol 99.6% and dried with a hot air and now it works.
I believe that the dust was probably trapped below the VRAM or some component where the sonic cleaner wasn't able to get to easily, but pressured water from a shower head set to jet did.
FWIW The water is reverse osmosis going through a filter and de-gasser.. woudn't do this with a regular tap water due to the risk of minerals sticking to the components.
Also did a VRAM stress test
kreyren@dreamon:~/Downloads/memtestG80$ sudo ./memtestG80 3500 5
-------------------------------------------------------------
| MemtestG80 v1.00 |
| |
| Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
| |
| Defaults: GPU 0, 128MB RAM, 50 test iterations |
| Amount of tested RAM will be rounded up to nearest 2MB |
-------------------------------------------------------------
Available flags:
--gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
--license ,-l : show license terms for this build
Running 5 iterations of tests over 3500 MB of GPU memory on card 0: GeForce GTX 970
Running memory bandwidth test over 20 iterations of 1750 MB transfers...
Estimated bandwidth 89171.97 MB/s
Test iteration 1 (GPU 0, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (161 ms)
Memtest86 Walking 8-bit: 0 errors (1275 ms)
True Walking zeros (8-bit): 0 errors (640 ms)
True Walking ones (8-bit): 0 errors (638 ms)
Moving Inversions (random): 0 errors (160 ms)
Memtest86 Walking zeros (32-bit): 0 errors (2550 ms)
Memtest86 Walking ones (32-bit): 0 errors (2554 ms)
Random blocks: 0 errors (288 ms)
Memtest86 Modulo-20: 0 errors (5342 ms)
Logic (one iteration): 0 errors (82 ms)
Logic (4 iterations): 0 errors (87 ms)
Logic (shared memory, one iteration): 0 errors (82 ms)
Logic (shared-memory, 4 iterations): 0 errors (87 ms)
Test iteration 2 (GPU 0, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (161 ms)
Memtest86 Walking 8-bit: 0 errors (1273 ms)
True Walking zeros (8-bit): 0 errors (622 ms)
True Walking ones (8-bit): 0 errors (611 ms)
Moving Inversions (random): 0 errors (156 ms)
Memtest86 Walking zeros (32-bit): 0 errors (2456 ms)
Memtest86 Walking ones (32-bit): 0 errors (2450 ms)
Random blocks: 0 errors (285 ms)
Memtest86 Modulo-20: 0 errors (5061 ms)
Logic (one iteration): 0 errors (79 ms)
Logic (4 iterations): 0 errors (81 ms)
Logic (shared memory, one iteration): 0 errors (79 ms)
Logic (shared-memory, 4 iterations): 0 errors (81 ms)
Test iteration 3 (GPU 0, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (155 ms)
Memtest86 Walking 8-bit: 0 errors (1228 ms)
True Walking zeros (8-bit): 0 errors (610 ms)
True Walking ones (8-bit): 0 errors (611 ms)
Moving Inversions (random): 0 errors (154 ms)
Memtest86 Walking zeros (32-bit): 0 errors (2440 ms)
Memtest86 Walking ones (32-bit): 0 errors (2438 ms)
Random blocks: 0 errors (285 ms)
Memtest86 Modulo-20: 0 errors (5030 ms)
Logic (one iteration): 0 errors (80 ms)
Logic (4 iterations): 0 errors (81 ms)
Logic (shared memory, one iteration): 0 errors (78 ms)
Logic (shared-memory, 4 iterations): 0 errors (82 ms)
Test iteration 4 (GPU 0, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (154 ms)
Memtest86 Walking 8-bit: 0 errors (1228 ms)
True Walking zeros (8-bit): 0 errors (610 ms)
True Walking ones (8-bit): 0 errors (609 ms)
Moving Inversions (random): 0 errors (155 ms)
Memtest86 Walking zeros (32-bit): 0 errors (2436 ms)
Memtest86 Walking ones (32-bit): 0 errors (2442 ms)
Random blocks: 0 errors (281 ms)
Memtest86 Modulo-20: 0 errors (5040 ms)
Logic (one iteration): 0 errors (79 ms)
Logic (4 iterations): 0 errors (81 ms)
Logic (shared memory, one iteration): 0 errors (78 ms)
Logic (shared-memory, 4 iterations): 0 errors (81 ms)
Test iteration 5 (GPU 0, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (155 ms)
Memtest86 Walking 8-bit: 0 errors (1231 ms)
True Walking zeros (8-bit): 0 errors (611 ms)
True Walking ones (8-bit): 0 errors (612 ms)
Moving Inversions (random): 0 errors (155 ms)
Memtest86 Walking zeros (32-bit): 0 errors (2439 ms)
Memtest86 Walking ones (32-bit): 0 errors (2435 ms)
Random blocks: 0 errors (282 ms)
Memtest86 Modulo-20: 0 errors (5045 ms)
Logic (one iteration): 0 errors (79 ms)
Logic (4 iterations): 0 errors (81 ms)
Logic (shared memory, one iteration): 0 errors (78 ms)
Logic (shared-memory, 4 iterations): 0 errors (81 ms)
Final error count after 5 iterations over 3500 MiB of GPU memory: 0 errors
Aaaaannnddd... its fucked again, this time it seems to be a new issue.
Was playing assassins creed black flag and other games for like 8 hours without any problem when it suddenly bricked the system and since its refusing to give display after driver load.
Running the same MemtestG80 above this time gave me +1 000 000 errors so i assume it being a VRAM failure which is supported by blue stripes on nouveau.
Help-wanted: Howddya figure out which VRAM is faulty?
https://manualzz.com/doc/8550916/samsung-k4g41325fc-hc28
Docs for the VRAM
Blocked by https://github.com/Kreyren/kreyren/issues/92
Was using this in dreamon when i suddenly got kicked from X11 and it stopped working on nvidia and on nouveau and on llvmpipe i get these green dots all over the screen:
At the time of failure the GPU's fans were powered from 12VDC external PSU keeping them at 100%, these fans are not connected to the PCB.
Hypothesis
Sudden death of a VRAM
Known issues
VRAM overheating
This GPU had issues with VRAM overheating before that caused it to fail-safe as the VRAM is just passively cooled using a 1mm alluminimum plate.
Diagnostics
[CONFIRMED] Conductive dust?
The GPU had more dust then i would like so it's possible that these were causing a short.
TODO