Digital-Defiance / nlp-metaformer

An ablation study on the transformer network for Natural Language Processing
3 stars 0 forks source link

docker pull rocm/pytorch:latest taking too long #7

Closed RuiFilipeCampos closed 8 months ago

RuiFilipeCampos commented 8 months ago

https://hub.docker.com/layers/rocm/pytorch/latest-base/images/sha256-13742ca73a5e679a096e098dbfab86a4e720ba6d13f9e7c696aa7b2bef885664?context=explore

RuiFilipeCampos commented 8 months ago

latest pull took 11 minutes

RuiFilipeCampos commented 8 months ago

I'm building and publishing to github repository, unsure if it will help

the actual solution is to have an AMI with the image already downloaded, but the AMI copy keeps getting corrupted

maybe it's possible to have a caching strategy by not deleting the root volumes

RuiFilipeCampos commented 8 months ago

publishing the image to aws registry will likely help

issue would be solved if I can somehow get the image to be in the same subnet as the instances

RuiFilipeCampos commented 8 months ago

setting up my own registry running in a server on the same subnet looks easy

gonna check if the GH registry improves the download speed, if that fails I think I'll go for the subnet solution

RuiFilipeCampos commented 8 months ago
´´´
 #4 ERROR: failed to register layer: write /var/lib/jenkins/pytorch/dist/torch-2.2.0a0+gitd925d94-cp39-cp39-linux_x86_64.whl: no space left on device
------
 > [1/2] FROM docker.io/rocm/pytorch:latest@sha256:cfc5bfe46ad5d487ef9a928f50d1f2ff0941b724a6978f6d6350d13ce2c6ca88:
------
Dockerfile:1
--------------------
   1 | >>> FROM rocm/pytorch:latest
   2 |     
   3 |     
--------------------
ERROR: failed to solve: failed to register layer: write /var/lib/jenkins/pytorch/dist/torch-2.2.0a0+gitd925d94-cp39-cp39-linux_x86_64.whl: no space left on device
Error: buildx failed with: ERROR: failed to solve: failed to register layer: write /var/lib/jenkins/pytorch/dist/torch-2.2.0a0+gitd925d94-cp39-cp39-linux_x86_64.whl: no space left on device
´´´

this can't be right, root volume has 70Gb

RuiFilipeCampos commented 8 months ago

download speed might not be the issue here

RuiFilipeCampos commented 8 months ago

I don't think it's reasonable that an image takes up so much space

will check for alternatives if issue persists

RuiFilipeCampos commented 8 months ago

download speed might not be the issue here

#4 extracting sha256:245963fcf05d7435a26d4d69c54b8cd5ec476c8bef7cc880e23a30a9870f9d06 67.3s done
#4 extracting sha256:f29924877b521710495b2d2600ab76dd4235eabdba8734731bb2a7214df49074
#4 extracting sha256:f29924877b521710495b2d2600ab76dd4235eabdba8734731bb2a7214df49074 done
#4 extracting sha256:f75f96015efd355e3028bcf4d00d7c20166cfdddb32997578c34a97b400239b6 done
#4 extracting sha256:4678caf5c6d0b04a9fa38c9b0bbc5b1d0951e2062f12f65b157d9fe6d31ec8b0 done
#4 extracting sha256:7b6837d5cec5dff110d9faa9231d204c9e3f7486f6ccb24a5374540877b199b2 0.1s
#4 extracting sha256:7b6837d5cec5dff110d9faa9231d204c9e3f7486f6ccb24a5374540877b199b2 3.5s done
#4 extracting sha256:93aeb7eb3ca73a8e9aad1130c4819fd7ac2324094806897b77e4859aca2f094c
#4 extracting sha256:93aeb7eb3ca73a8e9aad1130c4819fd7ac2324094806897b77e4859aca2f094c done
#4 extracting sha256:c2f5bdf8a66ccbbd3cbbd0f50577e404c1d70bcc987bc3a1cc450881d73ff4ee
#4 extracting sha256:c2f5bdf8a66ccbbd3cbbd0f50577e404c1d70bcc987bc3a1cc450881d73ff4ee done
#4 extracting sha256:15ea953907ba31b19327b0499373a71b03daba39f5c282d094eb25a425f4e360 0.0s done
#4 extracting sha256:98437ba9df8dec71e9aa154e976923e7d2564f891fb42382612300e99602fe7a
#4 extracting sha256:98437ba9df8dec71e9aa154e976923e7d2564f891fb42382612300e99602fe7a done
#4 extracting sha256:cfd6d8d012a2c7bd47cd799bd6458e2185a6304b27b9c5a4ed360314f666f5ba done
#4 extracting sha256:8245303e4b3e99c8275209ec96b3677404ae15b5d410d3d77b78a377363e53b9
#4 extracting sha256:8245303e4b3e99c8275209ec96b3677404ae15b5d410d3d77b78a377363e53b9 5.0s
#4 extracting sha256:8245303e4b3e99c8275209ec96b3677404ae15b5d410d3d77b78a377363e53b9 10.1s
#4 extracting sha256:8245303e4b3e99c8275209ec96b3677404ae15b5d410d3d77b78a377363e53b9 12.1s done
#4 extracting sha256:a200d53aa8e523fc6d57423a5b5f69b579335aca1ec2c42e954b6e6006b3b467 done
#4 extracting sha256:ae88aa8af19087516538b2cda8b7fbfb5d5a008af88fb0e23de576f9344439cd done
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 5.1s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 10.1s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 15.2s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 20.2s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 25.3s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 30.4s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 35.5s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 40.6s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 45.6s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 50.6s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 55.7s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 60.8s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 65.9s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 71.0s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 76.1s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 81.2s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 86.2s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 91.3s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 96.4s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 101.4s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 106.5s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 111.5s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 116.6s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 121.6s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 126.7s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 131.7s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 136.8s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 141.8s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 146.8s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 151.9s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 156.9s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 162.0s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 167.2s
#4 extracting sha256:ed1acbc46f62d320329118b7b36448b86685407cbc5c1da971929fc3012edb81 172.2s
RuiFilipeCampos commented 8 months ago

there's no other way, since the extraction is the problem the image must be pre-pulled on the machine

RuiFilipeCampos commented 8 months ago

found my solution https://hub.docker.com/r/rocm/rocm-terminal

will likely need to install py and etc

RuiFilipeCampos commented 8 months ago

found my solution https://hub.docker.com/r/rocm/rocm-terminal

will likely need to install py and etc

this comes with py pre-installed

gonna give it a go directly with the runner, needs to be tested like this because I don't have local AMD

RuiFilipeCampos commented 8 months ago

part of the issue is that pytorch is large

I got the pull time down to 8min

AMI with image pre-pulled will be the final solution

currently downgrading the project to py 3.8

RuiFilipeCampos commented 8 months ago
Run python3 -m train.worker_sentiment_analysis
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/__w/llm-voice-chat/llm-voice-chat/train/worker_sentiment_analysis.py", line 1, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
Error: Process completed with exit code 1.

unsure of how this is possible. this was clearly part of the build process and the image size matches it

RuiFilipeCampos commented 8 months ago

most likely scenario is the user switch, build used the default user while the run used root

RuiFilipeCampos commented 8 months ago

there's also this

#5 175.1   WARNING: The script sqlformat is installed in '/home/rocm-user/.local/bin' which is not on PATH.
#5 175.1   Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
#5 175.2   WARNING: The script wsdump is installed in '/home/rocm-user/.local/bin' which is not on PATH.
#5 175.2   Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
#5 178.3   WARNING: The script mlflow is installed in '/home/rocm-user/.local/bin' which is not on PATH.
#5 178.3   Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
#5 178.6   WARNING: The script dotenv is installed in '/home/rocm-user/.local/bin' which is not on PATH.
RuiFilipeCampos commented 8 months ago
2024-01-31 14:20:13,632 [INFO] Using device cpu
2024-01-31 14:20:13,632 [INFO] Using torch version 2.1.2+cu121
2024-01-31 14:20:13,632 [INFO] Using mlflow version 2.9.2
2024/01/31 14:20:14 WARNING mlflow.system_metrics.system_metrics_monitor: Skip logging GPU metrics because creating `GPUMonitor` failed with error: Failed to initialize NVML, skip logging GPU metrics: NVML Shared Library Not Found.
2024/01/31 14:20:14 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
2024-01-31 14:20:14,005 [INFO] Connected to MLFlow and started run.
2024-01-31 14:20:14,029 [INFO] Created model
2024-01-31 14:20:15,946 [INFO] Saved model info to MLFLow
2024-01-31 14:20:19,049 [INFO] Saved training info to MLFLow
2024-01-31 14:20:19,600 [INFO] Starting training loop
(1):   0%|          | 0/127 [00:00<?, ?it/s]
(1):   0%|          | 0/127 [00:05<?, ?it/s, train=11, lr=1.49e-03]
(1):   1%|          | 1/127 [00:05<11:05,  5.28s/it, train=11, lr=1.49e-03]
(1):   1%|          | 1/127 [00:10<11:05,  5.28s/it, train=11, lr=1.49e-03]
(1):   2%|▏         | 2/127 [00:10<10:56,  5.25s/it, train=11, lr=1.49e-03]

first ml loop being run

now comes the issue of ensuring that pytorch sees and uses the amd gpu

RuiFilipeCampos commented 8 months ago

im giving up on making my own image, what's the point if I'm going to create an AMI anyway

RuiFilipeCampos commented 8 months ago

ami-0fdb18af51d48e8f3

image pre-pulled, 100Gb, everything was tested, including in-container pytorch access to gpu

RuiFilipeCampos commented 8 months ago

AMI copy came out corrupted

I'm out of options

RuiFilipeCampos commented 8 months ago

I'm gonna have to park this for now

RuiFilipeCampos commented 8 months ago

this issue is just blocked by https://github.com/Digital-Defiance/llm-voice-chat/issues/6

everything else is resolved so I'm closing this one