ValueError: invalid literal for int() with base 10: 'CUDA'

elodiepaupe commented 2 years ago

Hello,

I have a problem when I launch these commands:

[paupeel1@gpu007.yggdrasil corpus_prevot_farine_fr]$ source ~/kraken-env/bin/activate
(kraken-env) [paupeel1@gpu007.yggdrasil corpus_prevot_farine_fr]$ salloc --partition=shared-gpu --time=01:00:00 --gpus=1
salloc: Granted job allocation 10231643
salloc: Waiting for resource configuration
salloc: Nodes gpu007 are ready for job
[paupeel1@gpu007.yggdrasil corpus_prevot_farine_fr]$ source ~/kraken-env/bin/activate
(kraken-env) [paupeel1@gpu007.yggdrasil corpus_prevot_farine_fr]$ ketos train -t train.txt -e eval.txt -f alto -d cuda -r 0.0001 --normalization NFD B168/*.xml

I've got this message:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/bin/ketos:8 in <module>                                        │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py:1128 in __call__     │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py:1053 in main         │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py:1659 in invoke       │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py:1395 in invoke       │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py:754 in invoke        │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/decorators.py:26 in new_func │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/ketos.py:598 in train       │
│                                                                                                  │
│    595 │   │   │   │   │   │   │    codec=codec,                                                 │
│    596 │   │   │   │   │   │   │    resize=resize)                                               │
│    597 │                                                                                         │
│ ❱  598 │   trainer = KrakenTrainer(gpus=device,                                                  │
│    599 │   │   │   │   │   │   │   max_epochs=hyper_params['epochs'] if hyper_params['quit'] ==  │
│    600 │   │   │   │   │   │   │   min_epochs=hyper_params['min_epochs'],                        │
│    601 │   │   │   │   │   │   │   enable_progress_bar=True if not ctx.meta['verbose'] else Fal  │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/lib/train.py:89 in __init__ │
│                                                                                                  │
│    86 │   │   else:                                                                              │
│    87 │   │   │   kwargs['enable_model_summary'] = False                                         │
│    88 │   │                                                                                      │
│ ❱  89 │   │   super().__init__(*args, **kwargs)                                                  │
│    90 │                                                                                          │
│    91 │   def on_validation_end(self):                                                           │
│    92 │   │   if not self.sanity_checking:                                                       │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/pytorch_lightning/trainer/connecto │
│ rs/env_vars_connector.py:38 in insert_env_defaults                                               │
│                                                                                                  │
│   35 │   │   kwargs = dict(list(env_variables.items()) + list(kwargs.items()))                   │
│   36 │   │                                                                                       │
│   37 │   │   # all args were already moved to kwargs                                             │
│ ❱ 38 │   │   return fn(self, **kwargs)                                                           │
│   39 │                                                                                           │
│   40 │   return insert_env_defaults                                                              │
│   41                                                                                             │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer. │
│ py:426 in __init__                                                                               │
│                                                                                                  │
│    423 │   │   Trainer._log_api_event("init")                                                    │
│    424 │   │   self.state = TrainerState()                                                       │
│    425 │   │                                                                                     │
│ ❱  426 │   │   gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)       │
│    427 │   │                                                                                     │
│    428 │   │   # init connectors                                                                 │
│    429 │   │   self._data_connector = DataConnector(self, multiple_trainloader_mode)             │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer. │
│ py:1543 in _parse_devices                                                                        │
│                                                                                                  │
│   1540 │   │   │   gpus = pick_multiple_gpus(gpus)                                               │
│   1541 │   │                                                                                     │
│   1542 │   │   # TODO (@seannaren, @kaushikb11): Include IPU parsing logic here                  │
│ ❱ 1543 │   │   gpu_ids = device_parser.parse_gpu_ids(gpus)                                       │
│   1544 │   │   tpu_cores = device_parser.parse_tpu_cores(tpu_cores)                              │
│   1545 │   │   return gpu_ids, tpu_cores                                                         │
│   1546                                                                                           │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/pytorch_lightning/utilities/device │
│ _parser.py:78 in parse_gpu_ids                                                                   │
│                                                                                                  │
│    75 │                                                                                          │
│    76 │   # We know user requested GPUs therefore if some of the                                 │
│    77 │   # requested GPUs are not available an exception is thrown.                             │
│ ❱  78 │   gpus = _normalize_parse_gpu_string_input(gpus)                                         │
│    79 │   gpus = _normalize_parse_gpu_input_to_list(gpus)                                        │
│    80 │   if not gpus:                                                                           │
│    81 │   │   raise MisconfigurationException("GPUs requested but none are available.")          │
│                                                                                                  │
│ /home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/pytorch_lightning/utilities/device │
│ _parser.py:131 in _normalize_parse_gpu_string_input                                              │
│                                                                                                  │
│   128 │   │   return -1                                                                          │
│   129 │   if "," in s:                                                                           │
│   130 │   │   return [int(x.strip()) for x in s.split(",") if len(x) > 0]                        │
│ ❱ 131 │   return int(s.strip())                                                                  │
│   132                                                                                            │
│   133                                                                                            │
│   134 def _sanitize_gpu_ids(gpus: List[int]) -> List[int]:                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: 'CUDA'

I try ketos train -t train.txt -e eval.txt -f alto -r 0.0001 --normalization NFD B168/*.xml but I have another issue RuntimeError: DataLoader worker (pid 106408) is killed by signal: Killed.

Is there a solution?

pkzli commented 2 years ago

Hi, I'm not able to reproduce the problem, here is my output :

(kraken-env) [kuenzlip@gpu002.yggdrasil corpus_prevot_farine_fr]$ ketos train -t train.txt -e eval.txt -f alto -d cuda -r 0.0001 --normalization NFD B168/*.xml
Building training set  [####################################]  2224/2224          
Building validation set  [####################################]  626/626          [1587.9705] alphabet mismatch: chars in training set only: {'̂', '#', '̈', '̀', '?', '☨'} (not included in accuracy test during training) 
Initializing model ✓
stage 1/∞  [####################################]  2224/2224          Accuracy report (1) 0.0000 24173 24173
stage 2/∞  [####################################]  2224/2224          Accuracy report (2) 0.0000 24173 24173
stage 3/∞  [####################################]  2224/2224          Accuracy report (3) 0.3281 24173 16241

Could you run

pip list

with your python environment loaded to confirm we are using the same module versions ?

elodiepaupe commented 2 years ago

Package                       Version
----------------------------- -----------
absl-py                       1.0.0
aiohttp                       3.8.1
aiosignal                     1.2.0
alabaster                     0.7.12
appdirs                       1.4.4
asn1crypto                    1.4.0
async-timeout                 4.0.2
atomicwrites                  1.4.0
attrs                         20.2.0
Babel                         2.8.0
bcrypt                        3.2.0
bitstring                     3.1.7
blist                         1.3.6
CacheControl                  0.12.6
cachetools                    5.0.0
cachy                         0.3.0
certifi                       2020.6.20
cffi                          1.14.3
chardet                       3.0.4
charset-normalizer            2.0.12
cleo                          0.8.1
click                         8.1.2
clikit                        0.6.2
colorama                      0.4.3
commonmark                    0.9.1
coremltools                   5.2.0
crashtest                     0.3.1
cryptography                  3.1.1
Cython                        0.29.21
decorator                     4.4.2
distlib                       0.3.1
docopt                        0.6.2
docutils                      0.16
ecdsa                         0.16.0
filelock                      3.0.12
flit                          3.0.0
flit-core                     3.0.0
frozenlist                    1.3.0
fsspec                        2022.3.0
future                        0.18.2
google-auth                   2.6.6
google-auth-oauthlib          0.4.6
grpcio                        1.44.0
html5lib                      1.1
idna                          2.10
imageio                       2.18.0
imagesize                     1.2.0
importlib-metadata            4.11.3
iniconfig                     1.0.1
intervaltree                  3.1.0
intreehooks                   1.0
ipaddress                     1.0.23
jeepney                       0.4.3
Jinja2                        2.11.2
joblib                        0.17.0
jsonschema                    3.2.0
keyring                       21.4.0
keyrings.alt                  4.0.0
kraken                        4.1.2
liac-arff                     2.5.0
lockfile                      0.12.2
lxml                          4.8.0
Markdown                      3.3.6
MarkupSafe                    1.1.1
mock                          4.0.2
more-itertools                8.5.0
mpmath                        1.2.1
msgpack                       1.0.0
multidict                     6.0.2
netaddr                       0.8.0
netifaces                     0.10.9
networkx                      2.8
nose                          1.3.7
numpy                         1.22.3
oauthlib                      3.2.0
packaging                     20.4
paramiko                      2.7.2
pastel                        0.2.1
pathlib2                      2.3.5
paycheck                      1.0.2
pbr                           5.5.0
pexpect                       4.8.0
Pillow                        9.1.0
pip                           20.2.3
pkginfo                       1.5.0.1
pluggy                        0.13.1
poetry                        1.1.3
poetry-core                   1.0.0
protobuf                      3.20.1
psutil                        5.7.2
ptyprocess                    0.6.0
py                            1.9.0
py-expression-eval            0.3.10
pyarrow                       7.0.0
pyasn1                        0.4.8
pyasn1-modules                0.2.8
pycparser                     2.20
pycrypto                      2.6.1
pyDeprecate                   0.3.2
Pygments                      2.7.1
pylev                         1.3.0
PyNaCl                        1.4.0
pyparsing                     2.4.7
pyrsistent                    0.17.3
pytest                        6.1.1
python-bidi                   0.4.2
python-dateutil               2.8.1
pytoml                        0.1.21
pytorch-lightning             1.6.1
pytz                          2020.1
PyWavelets                    1.3.0
PyYAML                        6.0
regex                         2020.10.11
requests                      2.24.0
requests-oauthlib             1.3.1
requests-toolbelt             0.9.1
rich                          12.2.0
rsa                           4.8
scandir                       1.10.0
scikit-image                  0.19.2
scipy                         1.8.0
SecretStorage                 3.1.2
setuptools                    50.3.0
setuptools-scm                4.1.2
Shapely                       1.8.1.post1
shellingham                   1.3.2
simplegeneric                 0.8.1
simplejson                    3.17.2
six                           1.15.0
snowballstemmer               2.0.0
sortedcontainers              2.2.2
Sphinx                        3.2.1
sphinx-bootstrap-theme        0.7.1
sphinxcontrib-applehelp       1.0.2
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        1.0.3
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.4
sphinxcontrib-websupport      1.2.4
sympy                         1.10.1
tabulate                      0.8.7
tensorboard                   2.8.0
tensorboard-data-server       0.6.1
tensorboard-plugin-wit        1.8.1
threadpoolctl                 2.1.0
tifffile                      2022.4.22
toml                          0.10.1
tomlkit                       0.7.0
torch                         1.11.0
torchmetrics                  0.8.0
torchvision                   0.12.0
tqdm                          4.64.0
typing-extensions             4.2.0
ujson                         4.0.1
urllib3                       1.25.10
virtualenv                    20.0.34
wcwidth                       0.2.5
webencodings                  0.5.1
Werkzeug                      2.1.1
wheel                         0.35.1
xlrd                          1.2.0
yarl                          1.7.2
zipp                          3.3.0

pkzli commented 2 years ago

This is actually due to an update in the parameters of ketos.

If you run

ketos train -t train.txt -e eval.txt -f alto -d cuda:0 -r 0.0001 --normalization NFD B168/*.xml

(notice the :0 after cuda) it should start... but then ends in the "dataloader killed" error ...

Could you try downgrading ketos to 3.0.4 (the one I manager to make work) with

pip install kraken==3.04

And try again ? (with your original command)

I think we should update the tutorial to explicitly set versions of modules to avoid this kind of situation (unidentified bugs or feature change of new versions of python modules)

elodiepaupe commented 2 years ago

So I install kraken 3.0.4 and try it with my original command:

(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ kraken --version
kraken, version 3.0.4
(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ sbatch submission_scripts.sh
Submitted batch job 10263649
(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ nano kraken-10263649.out

gpu008
KETOS training
Traceback (most recent call last):
  File "/home/users/p/paupeel1/kraken-env/bin/ketos", line 8, in <module>
    sys.exit(cli())
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/ketos.py", line 388, in train
    from kraken.lib.train import KrakenTrainer
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/lib/train.py", line 36, in <module>
    from kraken.lib.dataset import BaselineSet, GroundTruthDataset, PolygonGTDataset, generate_input_transforms, preparse_xml_data, Infi$
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/lib/dataset.py", line 29, in <module>
    import torchvision.transforms.functional as tf
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import models
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/__init__.py", line 8, in <module>
    from .mobilenet import *
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/mobilenet.py", line 1, in <module>
    from .mobilenetv2 import MobileNetV2, mobilenet_v2, __all__ as mv2_all
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/mobilenetv2.py", line 8, in <module>
    from ..ops.misc import ConvNormActivation
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/ops/__init__.py", line 12, in <module>
    from .stochastic_depth import stochastic_depth, StochasticDepth
  File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/ops/stochastic_depth.py", line 2, in <module>
    import torch.fx
ModuleNotFoundError: No module named 'torch.fx'
srun: error: gpu008: task 0: Exited with exit code 1

During the installation of Kraken, torch 1.7.0 has been install and torch 1.10.2 has been uninstall. Is that the (new) problem?

pkzli commented 2 years ago

Yes, python module versions can be tricky to manage and you can easily be trapped with incompatible module versions. Now I would advise to use a tool to pin versions of python packages (explicitly set versions of a python package and all of its dependencies).

Can you try the following :

With your python environment activated, run

pip install pip-tools==6.6.2 pip==22.1.2

pip-tools is the tool we will be using to pin packages versions.

Then, download and save

requirements.txt

to the cluster. In requirements.txt there are all the modules with the specific versions needed to make kraken 4.1.2 works.

Then, on the cluster, while being on the same directory as requirements.txt and with the python environment loaded, run :

pip-sync

It should uninstall any installed module (so if you manually installed other modules, it will uninstall them) and install required modules.

Then, try to run a training task, but add --ntasks=4 to your salloccommand. As indicated here #7

FoNDUE-HTR / Documentation

ValueError: invalid literal for int() with base 10: 'CUDA' #11