Substra / substrafl

A high-level federated learning Python library used to run complex federated learning experiments at scale on a Substra network
https://docs.substra.org
Apache License 2.0
56 stars 4 forks source link

BUG: Encountered Operation not supported OSError when running MNIST Torch example #233

Open hwpang opened 1 month ago

hwpang commented 1 month ago

What are you trying to do?

I am a new user to SubstraFL and am currently going through the example at https://docs.substra.org/en/stable/examples/substrafl/get_started/run_mnist_torch.html.

Issue Description (what is happening?)

The notebook failed at the following cell with an OSError.

from substrafl.experiment import execute_experiment
import logging
import substrafl

substrafl.set_logging_level(loglevel=logging.ERROR)
# A round is defined by a local training step followed by an aggregation operation
NUM_ROUNDS = 3

compute_plan = execute_experiment(
    client=clients[ALGO_ORG_ID],
    strategy=strategy,
    train_data_nodes=train_data_nodes,
    evaluation_strategy=my_eval_strategy,
    aggregation_node=aggregation_node,
    num_rounds=NUM_ROUNDS,
    experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"),
    dependencies=dependencies,
    clean_models=False,
    name="MNIST documentation example",
)

Expected Behavior (what should happen?)

Expected to not have the error when running the tutorial.

Reproducible Example

No response

Operating system

Ubuntu 20.04

Python version

3.11.9

Installed Substra versions

substra==0.53.0
substrafl==0.46.0
substratools==0.21.4

Installed versions of dependencies

# packages in environment at /mnt/batch/tasks/shared/LS_root/mounts/clusters/hpang8/code/Users/hpang/conda_envs/substrafl_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
annotated-types           0.7.0                    pypi_0    pypi
anyio                     4.2.0           py311h06a4308_0  
argon2-cffi               21.3.0             pyhd3eb1b0_0  
argon2-cffi-bindings      21.2.0          py311h5eee18b_0  
asttokens                 2.0.5              pyhd3eb1b0_0  
async-lru                 2.0.4           py311h06a4308_0  
attrs                     23.1.0          py311h06a4308_0  
babel                     2.11.0          py311h06a4308_0  
beautifulsoup4            4.12.3          py311h06a4308_0  
bleach                    4.1.0              pyhd3eb1b0_0  
brotli-python             1.0.9           py311h6a678d5_8  
build                     1.2.1                    pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6  
ca-certificates           2024.7.2             h06a4308_0  
certifi                   2024.7.4        py311h06a4308_0  
cffi                      1.16.0          py311h5eee18b_1  
charset-normalizer        3.3.2              pyhd3eb1b0_0  
click                     8.1.7                    pypi_0    pypi
cloudpickle               3.0.0                    pypi_0    pypi
cmake                     3.30.1                   pypi_0    pypi
comm                      0.2.1           py311h06a4308_0  
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
debugpy                   1.6.7           py311h6a678d5_0  
decorator                 5.1.1              pyhd3eb1b0_0  
defusedxml                0.7.1              pyhd3eb1b0_0  
docker                    7.1.0                    pypi_0    pypi
executing                 0.8.3              pyhd3eb1b0_0  
expat                     2.6.2                h6a678d5_0  
filelock                  3.15.4                   pypi_0    pypi
fonttools                 4.53.1                   pypi_0    pypi
idna                      3.7             py311h06a4308_0  
ipykernel                 6.28.0          py311h06a4308_0  
ipython                   8.25.0          py311h06a4308_0  
jedi                      0.19.1          py311h06a4308_0  
jinja2                    3.1.4           py311h06a4308_0  
joblib                    1.4.2                    pypi_0    pypi
json5                     0.9.6              pyhd3eb1b0_0  
jsonschema                4.19.2          py311h06a4308_0  
jsonschema-specifications 2023.7.1        py311h06a4308_0  
jupyter-lsp               2.2.0           py311h06a4308_0  
jupyter_client            8.6.0           py311h06a4308_0  
jupyter_core              5.7.2           py311h06a4308_0  
jupyter_events            0.10.0          py311h06a4308_0  
jupyter_server            2.14.1          py311h06a4308_0  
jupyter_server_terminals  0.4.4           py311h06a4308_1  
jupyterlab                4.0.11          py311h06a4308_0  
jupyterlab_pygments       0.1.2                      py_0  
jupyterlab_server         2.25.1          py311h06a4308_0  
kiwisolver                1.4.5                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
lit                       18.1.8                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.6.3                    pypi_0    pypi
matplotlib-inline         0.1.6           py311h06a4308_0  
mistune                   2.0.4           py311h06a4308_0  
mpmath                    1.3.0                    pypi_0    pypi
nbclient                  0.8.0           py311h06a4308_0  
nbconvert                 7.10.0          py311h06a4308_0  
nbformat                  5.9.2           py311h06a4308_0  
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.6.0           py311h06a4308_0  
networkx                  3.3                      pypi_0    pypi
notebook                  7.0.8           py311h06a4308_2  
notebook-shim             0.2.3           py311h06a4308_0  
numpy                     1.24.3                   pypi_0    pypi
nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi
openssl                   3.0.14               h5eee18b_0  
overrides                 7.4.0           py311h06a4308_0  
packaging                 24.1            py311h06a4308_0  
pandas                    1.5.3                    pypi_0    pypi
pandocfilters             1.5.0              pyhd3eb1b0_0  
parso                     0.8.3              pyhd3eb1b0_0  
pexpect                   4.8.0              pyhd3eb1b0_3  
pillow                    10.4.0                   pypi_0    pypi
pip                       24.0            py311h06a4308_0  
pip-tools                 7.4.1                    pypi_0    pypi
platformdirs              3.10.0          py311h06a4308_0  
prometheus_client         0.14.1          py311h06a4308_0  
prompt-toolkit            3.0.43          py311h06a4308_0  
prompt_toolkit            3.0.43               hd3eb1b0_0  
psutil                    5.9.0           py311h5eee18b_0  
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  2.8.2                    pypi_0    pypi
pydantic-core             2.20.1                   pypi_0    pypi
pygments                  2.15.1          py311h06a4308_1  
pyparsing                 3.1.2                    pypi_0    pypi
pyproject-hooks           1.1.0                    pypi_0    pypi
pysocks                   1.7.1           py311h06a4308_0  
python                    3.11.9               h955ad1f_0  
python-dateutil           2.9.0post0      py311h06a4308_2  
python-fastjsonschema     2.16.2          py311h06a4308_0  
python-json-logger        2.0.7           py311h06a4308_0  
python-slugify            8.0.4                    pypi_0    pypi
pytz                      2024.1          py311h06a4308_0  
pyyaml                    6.0.1           py311h5eee18b_0  
pyzmq                     25.1.2          py311h6a678d5_0  
readline                  8.2                  h5eee18b_0  
referencing               0.30.2          py311h06a4308_0  
requests                  2.31.0                   pypi_0    pypi
rfc3339-validator         0.1.4           py311h06a4308_0  
rfc3986-validator         0.1.1           py311h06a4308_0  
rpds-py                   0.10.6          py311hb02cf49_0  
scikit-learn              1.3.1                    pypi_0    pypi
scipy                     1.14.0                   pypi_0    pypi
send2trash                1.8.2           py311h06a4308_0  
setuptools                69.5.1          py311h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sniffio                   1.3.0           py311h06a4308_0  
soupsieve                 2.5             py311h06a4308_0  
sqlite                    3.45.3               h5eee18b_0  
stack_data                0.2.0              pyhd3eb1b0_0  
substra                   0.53.0                   pypi_0    pypi
substrafl                 0.46.0                   pypi_0    pypi
substratools              0.21.4                   pypi_0    pypi
sympy                     1.13.1                   pypi_0    pypi
terminado                 0.17.1          py311h06a4308_0  
text-unidecode            1.3                      pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
tinycss2                  1.2.1           py311h06a4308_0  
tk                        8.6.14               h39e8969_0  
torch                     2.0.1                    pypi_0    pypi
torchvision               0.15.2                   pypi_0    pypi
tornado                   6.4.1           py311h5eee18b_0  
tqdm                      4.66.4                   pypi_0    pypi
traitlets                 5.14.3          py311h06a4308_0  
triton                    2.0.0                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
typing_extensions         4.11.0          py311h06a4308_0  
tzdata                    2024a                h04d1e81_0  
urllib3                   2.2.2           py311h06a4308_0  
wcwidth                   0.2.5              pyhd3eb1b0_0  
webencodings              0.5.1           py311h06a4308_1  
websocket-client          1.8.0           py311h06a4308_0  
wheel                     0.43.0          py311h06a4308_0  
xz                        5.4.6                h5eee18b_1  
yaml                      0.2.5                h7b6447c_0  
zeromq                    4.3.5                h6a678d5_0  
zlib                      1.2.13               h5eee18b_1  

Logs / Stacktrace

Rounds progress: 100%|██████████| 3/3 [00:00<00:00, 1050.24it/s]
Compute plan progress:  10%|▉         | 2/21 [02:35<24:34, 77.61s/it]
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[14], line 9
      6 # A round is defined by a local training step followed by an aggregation operation
      7 NUM_ROUNDS = 3
----> 9 compute_plan = execute_experiment(
     10     client=clients[ALGO_ORG_ID],
     11     strategy=strategy,
     12     train_data_nodes=train_data_nodes,
     13     evaluation_strategy=my_eval_strategy,
     14     aggregation_node=aggregation_node,
     15     num_rounds=NUM_ROUNDS,
     16     experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"),
     17     dependencies=dependencies,
     18     clean_models=False,
     19     name="MNIST documentation example",
     20 )

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substrafl/experiment.py:498, in execute_experiment(client, strategy, train_data_nodes, experiment_folder, num_rounds, aggregation_node, evaluation_strategy, dependencies, clean_models, name, additional_metadata, task_submission_batch_size)
    485 # save the experiment summary in experiment_folder
    486 _save_experiment_summary(
    487     experiment_folder=experiment_folder,
    488     compute_plan_key=compute_plan_key,
   (...)
    496     additional_metadata=additional_metadata,
    497 )
--> 498 compute_plan = client.add_compute_plan(
    499     substra.sdk.schemas.ComputePlanSpec(
    500         key=compute_plan_key,
    501         tasks=tasks,
    502         name=name or timestamp,
    503         metadata=cp_metadata,
    504     ),
    505     auto_batching=True,
    506     batch_size=task_submission_batch_size,
    507 )
    508 logger.info(("The compute plan has been registered to Substra, its key is {0}.").format(compute_plan.key))
    509 return compute_plan

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/client.py:48, in logit.<locals>.wrapper(*args, **kwargs)
     46 error = None
     47 try:
---> 48     return f(*args, **kwargs)
     49 except Exception as e:
     50     error = e.__class__.__name__

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/client.py:548, in Client.add_compute_plan(self, data, auto_batching, batch_size)
    542 if not is_valid_uuid(spec.key):
    543     raise exceptions.ComputePlanKeyFormatError(
    544         "The compute plan key has to respect the UUID format. You can use the uuid library to generate it. \
    545     Example: compute_plan_key=str(uuid.uuid4())"
    546     )
--> 548 return self._backend.add(spec, spec_options=spec_options)

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/backend.py:487, in Local.add(self, spec, spec_options, key)
    485 else:
    486     if spec.__class__.type_ == schemas.Type.ComputePlan:
--> 487         compute_plan = add_asset(spec, spec_options)
    488         return compute_plan
    489     else:

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/backend.py:406, in Local._add_compute_plan(self, spec, spec_options)
    403 compute_plan = self._db.add(compute_plan)
    405 # go through the tasks sorted by rank
--> 406 compute_plan = self.__execute_compute_plan(spec, compute_plan, visited, tasks, spec_options)
    407 return compute_plan

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/backend.py:269, in Local.__execute_compute_plan(self, spec, compute_plan, visited, tasks, spec_options)
    266         if not task_spec:
    267             continue
--> 269         self.add(
    270             key=task_spec.key,
    271             spec=task_spec,
    272             spec_options=spec_options,
    273         )
    275         progress_bar.update()
    277 return compute_plan

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/backend.py:491, in Local.add(self, spec, spec_options, key)
    489 else:
    490     key = key or spec.compute_key()
--> 491     add_asset(key, spec, spec_options)
    492     return key

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/backend.py:437, in Local._add_task(self, key, spec, spec_options)
    420 task = models.Task(
    421     key=key,
    422     creation_date=self.__now(),
   (...)
    433     metadata=spec.metadata if spec.metadata else dict(),
    434 )
    436 task = self._db.add(task)
--> 437 self._worker.schedule_task(task)
    438 return task

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/compute/worker.py:313, in Worker.schedule_task(self, task)
    310 elif asset_type == schemas.Type.Dataset:
    311     dataset = self._db.get_with_files(schemas.Type.Dataset, task_input.asset_key)
    312     cmd_line_inputs.append(
--> 313         self._prepare_dataset_input(
    314             dataset=dataset,
    315             task_input=task_input,
    316             input_volume=volumes[VOLUME_INPUTS],
    317             multiple=multiple,
    318         )
    319     )
    320     addable_asset = dataset
    322 if addable_asset:

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/site-packages/substra/sdk/backends/local/compute/worker.py:161, in Worker._prepare_dataset_input(self, dataset, task_input, input_volume, multiple)
    157 def _prepare_dataset_input(
    158     self, dataset: models.Dataset, task_input: models.InputRef, input_volume: str, multiple: bool
    159 ):
    160     path_to_opener = input_volume / Filenames.OPENER.value
--> 161     Path(dataset.opener.storage_address).link_to(path_to_opener)
    162     return TaskResource(
    163         id=task_input.identifier,
    164         value=f"{TPL_VOLUME_INPUTS}/{Filenames.OPENER.value}",
    165         multiple=multiple,
    166     )

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/pathlib.py:1226, in Path.link_to(self, target)
   1211 """
   1212 Make the target path a hard link pointing to this path.
   1213 
   (...)
   1220 Use `hardlink_to()` instead.
   1221 """
   1222 warnings.warn("pathlib.Path.link_to() is deprecated and is scheduled "
   1223               "for removal in Python 3.12. "
   1224               "Use pathlib.Path.hardlink_to() instead.",
   1225               DeprecationWarning, stacklevel=2)
-> 1226 self.__class__(target).hardlink_to(self)

File ~/cloudfiles/code/Users/hpang/conda_envs/substrafl_env/lib/python3.11/pathlib.py:1208, in Path.hardlink_to(self, target)
   1206 if not hasattr(os, "link"):
   1207     raise NotImplementedError("os.link() not available on this system")
-> 1208 os.link(target, self)

OSError: [Errno 95] Operation not supported: '/mnt/batch/tasks/shared/LS_root/mounts/clusters/hpang8/code/Users/hpang/Projects/Federated_learning/substrafl/local-worker/yumnknd_/61c0f7fa-5228-4804-9d24-8beac24bfbc2/mnist_opener.py' -> '/mnt/batch/tasks/shared/LS_root/mounts/clusters/hpang8/code/Users/hpang/Projects/Federated_learning/substrafl/local-worker/d18aa0b7-4aaf-4a4d-9e87-ebead4d168f9/inputs/opener.py'
SdgJlbl commented 1 month ago

Thanks a lot for raising this issue. We were aware that the way of handling paths had changed in 3.12, but I didn't know that it could affect Python versions before that. We will look into it.

KindEmily commented 1 month ago

Hey @hwpang

I`m currently also facing an issue with this tutorial

Would appreciate any help if you`re managed to finish that tutorial

Contact me pls 👋

P.s. I'm also active on Substra slack channel, you're very welcomed to come say hi and share your current progress I`d be happy to have a contact with anyone I can discuss the potential problems solutions

You can find the Slack channel invite in the Substra community URL: https://docs.substra.org/en/stable/additional/community.html

Help me pls 🆘

And if you would like to check on my issue, please take a look at the Run-experiment-console-error-help-request branch URL: https://github.com/KindEmily/Using-Torch-FedAvg-on-MNIST-dataset/tree/Run-experiment-console-error-help-request

image

KindEmily commented 1 month ago

@SdgJlbl Kindly asking if you managed to check on this ? 🥺

KindEmily commented 3 weeks ago

@hwpang I was able to finish the tutorial by using flat structure instead of modules (putting all the code in a single file e.g. main.py)