Closed elodiepaupe closed 2 years ago
Hi, I'm not able to reproduce the problem, here is my output :
(kraken-env) [kuenzlip@gpu002.yggdrasil corpus_prevot_farine_fr]$ ketos train -t train.txt -e eval.txt -f alto -d cuda -r 0.0001 --normalization NFD B168/*.xml
Building training set [####################################] 2224/2224
Building validation set [####################################] 626/626 [1587.9705] alphabet mismatch: chars in training set only: {'̂', '#', '̈', '̀', '?', '☨'} (not included in accuracy test during training)
Initializing model ✓
stage 1/∞ [####################################] 2224/2224 Accuracy report (1) 0.0000 24173 24173
stage 2/∞ [####################################] 2224/2224 Accuracy report (2) 0.0000 24173 24173
stage 3/∞ [####################################] 2224/2224 Accuracy report (3) 0.3281 24173 16241
Could you run
pip list
with your python environment loaded to confirm we are using the same module versions ?
This is actually due to an update in the parameters of ketos.
If you run
ketos train -t train.txt -e eval.txt -f alto -d cuda:0 -r 0.0001 --normalization NFD B168/*.xml
(notice the :0 after cuda) it should start... but then ends in the "dataloader killed" error ...
Could you try downgrading ketos to 3.0.4 (the one I manager to make work) with
pip install kraken==3.04
And try again ? (with your original command)
I think we should update the tutorial to explicitly set versions of modules to avoid this kind of situation (unidentified bugs or feature change of new versions of python modules)
So I install kraken 3.0.4 and try it with my original command:
(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ kraken --version
kraken, version 3.0.4
(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ sbatch
Submitted batch job 10263649
(kraken-env) [paupeel1@login1.yggdrasil corpus_prevot_farine_fr]$ nano kraken-10263649.out
KETOS training
Traceback (most recent call last):
File "/home/users/p/paupeel1/kraken-env/bin/ketos", line 8, in <module>
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 1053, in main
rv = self.invoke(ctx)
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/click/", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/", line 388, in train
from kraken.lib.train import KrakenTrainer
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/lib/", line 36, in <module>
from kraken.lib.dataset import BaselineSet, GroundTruthDataset, PolygonGTDataset, generate_input_transforms, preparse_xml_data, Infi$
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/kraken/lib/", line 29, in <module>
import torchvision.transforms.functional as tf
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/", line 6, in <module>
from torchvision import models
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/", line 8, in <module>
from .mobilenet import *
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/", line 1, in <module>
from .mobilenetv2 import MobileNetV2, mobilenet_v2, __all__ as mv2_all
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/models/", line 8, in <module>
from ..ops.misc import ConvNormActivation
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/ops/", line 12, in <module>
from .stochastic_depth import stochastic_depth, StochasticDepth
File "/home/users/p/paupeel1/kraken-env/lib/python3.8/site-packages/torchvision/ops/", line 2, in <module>
import torch.fx
ModuleNotFoundError: No module named 'torch.fx'
srun: error: gpu008: task 0: Exited with exit code 1
During the installation of Kraken, torch 1.7.0 has been install and torch 1.10.2 has been uninstall. Is that the (new) problem?
Yes, python module versions can be tricky to manage and you can easily be trapped with incompatible module versions. Now I would advise to use a tool to pin versions of python packages (explicitly set versions of a python package and all of its dependencies).
Can you try the following :
With your python environment activated, run
pip install pip-tools==6.6.2 pip==22.1.2
pip-tools is the tool we will be using to pin packages versions.
Then, download and save
to the cluster. In requirements.txt
there are all the modules with the specific versions needed to make kraken 4.1.2 works.
Then, on the cluster, while being on the same directory as requirements.txt
and with the python environment loaded, run :
It should uninstall any installed module (so if you manually installed other modules, it will uninstall them) and install required modules.
Then, try to run a training task, but add --ntasks=4
to your salloc
command. As indicated here #7
I have a problem when I launch these commands:
I've got this message:
I try
ketos train -t train.txt -e eval.txt -f alto -r 0.0001 --normalization NFD B168/*.xml
but I have another issueRuntimeError: DataLoader worker (pid 106408) is killed by signal: Killed.
Is there a solution?