Make torch dependency more flexible

jluethi commented 1 year ago

Currently, we hardcode torch version 1.12 in the fractal-tasks-core dependencies to make it work well on older UZH GPUs. The tasks themselves don't depend on that torch version though and run fine in other torch versions (e.g. 1.13 or even the new 2.0.0).

The 1.12 dependency made some issues on @gusqgm Windows Subsystem Linux test. On the FMI cluster, it's fine on some GPU nodes, but actually runs in the error below on other GPU nodes. I tested with torch 2.0.0 now and then everything works.

Thus, we should make the torch version more flexible. The correct torch version to install depends on the infrastructure, not the task package.

A workaround until we have it is to manually install torch of a given version into the task venv:

source /path/to/task-envs/.fractal/fractal-tasks-core0.9.0/venv/bin/activate
pip uninstall torch
pip install torch==2.0.0

If someone is searching for it, I'm hitting this error message when the torch version doesn't match:

Traceback (most recent call last):
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 693, in <module>
    run_fractal_task(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/_utils.py", line 91, in run_fractal_task
    metadata_update = task_function(**task_args.dict(exclude_unset=True))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 542, in cellpose_segmentation
    new_label_img = masked_loading_wrapper(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_masked_loading.py", line 240, in masked_loading_wrapper
    new_label_img = function(image_array, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 110, in segment_ROI
    mask, _, _ = model.eval(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 552, in eval
    masks, styles, dP, cellprob, p = self._run_cp(x,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 616, in _run_cp
    yf, style = self._run_nets(img, net_avg=net_avg,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 363, in _run_nets
    y, style = self._run_net(img, augment=augment, tile=tile, tile_overlap=tile_overlap,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 442, in _run_net
    y, style = self._run_tiled(imgs, augment=augment, bsize=bsize,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 543, in _run_tiled
    y0, style = self.network(IMG[irange], return_conv=return_conv)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 315, in network
    y, style = self.net(X)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 202, in forward
    T0    = self.downsample(data)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 84, in forward
    xd.append(self.down[n](y))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 47, in forward
    x = self.proj(x) + self.conv[1](self.conv[0](x))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

tcompa commented 1 year ago

I guess this was meant to be a fractal-tasks-core issue (unless its goal is to provide a way to install different packages on different clusters).

Relevant refs on CUDA/pytorch versions and compatibility:

tcompa commented 1 year ago

(also: ref https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/220)

jluethi commented 1 year ago

My bad. Yes, it should be a tasks issue :)

And the goal would be to allow an admin setting things up or a user installing the core tasks to get more control about which torch version is used. The effect would be that different torch versions are installed on different clusters. Not sure what the best way to make this happen will be, but it shouldn't be a server concern if at all possible :)

tcompa commented 1 year ago

A possible way out would be to add package extras, so that one could install the package as

pip install fractal-tasks-core[pytorch112]
pip install fractal-tasks-core[pytorch113]

Let's rediscuss it.

jluethi commented 1 year ago

Optional extras specify the pytorch version

If nothing is specified, pip install cellpose will install something (likely the newest pytorch version)

jluethi commented 1 year ago

What is our plan regarding torch versions for the fractal-tasks extra? Not the biggest fan of multiple different extra editions tbh, but would be great to allow the torch installation to work better (i.e. also work "out of the box" on more modern system than the UZH GPUs)

tcompa commented 1 year ago

Refs (to explore further):

https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.
https://github.com/python-poetry/poetry/issues/5222
https://github.com/python-poetry/poetry/issues/4231
https://github.com/python-poetry/poetry/issues/6409

tcompa commented 1 year ago

https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.

See

Maybe doable by combining

[tool.poetry.dependencies]
pathlib2 = { version = "^2.2", markers = "python_version <= '3.4' or sys_platform == 'win32'" }

with

[tool.poetry.dependencies]
foo = [
    {version = "<=1.9", python = ">=3.6,<3.8"},
    {version = "^2.0", python = ">=3.8"}
]

tcompa commented 1 year ago

We explored multiple options with @mfranzon, and we don't see any which makes sense to us via conditional dependencies or something similar. We then propose that:

fractal-tasks-core depends on a more flexible torch version (e.g. <=2.0.0)
The sysadmin keeps installing the "correct" version (of torch, for instance) after the task collection is complete.

Since this is very tedious, we also propose the following workaround for doing it automatically (to be included in fractal-server - we can then open issue over there). The /api/v1/task/collect/pip/ currently takes this request body:

{
  "package": "string",
  "package_version": "string",
  "package_extras": "string",
  "python_version": "string"
}

We could add an additional attribute, like custom_package_versions. This would be empty by default, and only at UZH we would set it to custom_package_versions={"torch": "1.12.1"}. The behavior of the task collection would then be:

Perform the whole installation in the standard way (NOTE: this must not fail!)
After the installation is complete, run pip install torch==1.12.1 (where pip is replaced by the actual venv pip that is being used).

CAVEAT: this is messing with the package, and thus creating a not-so-clean log of the installation (although we would still include also the additional torch-installation logs). Such an operation is meant to be restricted to very specific cases, where there is an important dependency on hardware or system libraries - things that a regular user should not be using.

IMPORTANT NOTE 1 This workaround cannot bring us out of the versions supported by fractal-tasks-core (for instance). Say that we now require torch>=1.13.0, and then we set custom_package_versions={"torch": "1.12.1"}. This task-collection operation will fail, because the installation of the custom package goes conflicts with fractal-tasks-core.

IMPORTANT NOTE 2 We should never use this feature to install an additional package. For instance if the fractal-tasks-core does not depend on polars, and we specify custom_package_versions={"polars": "1.0"}, then task collection will fail.

MINOR NOTE: This also fits perfectly with https://github.com/fractal-analytics-platform/fractal-server/issues/686, where we would only need to add the same pip install line in the script.

jluethi commented 1 year ago

Thanks for digging into this! Sounds good to me.

I already tested it with torch 2.0.0 on the FMI side and that also works, so I don't see a strong reason for limiting the torch version at all for the time being.

Having the custom_package_versions sounds convenient for the Pelkmans lab setup. If that's not a major effort, I'd be in favor of having this.

tcompa commented 1 year ago

Server work is deferred to https://github.com/fractal-analytics-platform/fractal-server/issues/740. This issue remains for

[ ] Make torch dependency more flexible (no constraints at all? up to 2.something?)

jluethi commented 1 year ago

I have seen no reason for constraints so far, given that 2.0.0 still worked well. We just need torch for cellpose, right? Do we still add it as an explicit dependency for the extras (to make the custom_package_versions workaround work) or is that not necessary?

jluethi commented 1 year ago

Basically, our torch constraint is: 1) Whatever cellpose needs => they define that 2) Whatever local hardware requires (=> custom_package_versions)

tcompa commented 1 year ago

We just need torch for cellpose, right?

Anndata also uses it, but they are not very strict in the dependency version: torch is not listed as a direct dependency in https://github.com/scverse/anndata/blob/main/pyproject.toml, and pip install anndata in a fresh environment does not install it. I think they just try to import it, and have a fall-back options if the import fails.

To do:

[ ] Find out whether this information is available in some docs, otherwise open an anndata issue about torch supported versions (or at least version that are known to fail)
[ ] Find out whether the issues that we sometimes find in the CI are only happening when torch is an implicit dependency or also when we explicitly include it in pyproject.toml.

Note: the list below is a bunch of not-very-systematic tests. This is all preliminary, but it'd be nice to understand things clearly - since we are already at it.

Here are some raw CI tests

torch=1.12 works fine (as expected, since we always used it), see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201320932
torch=1.13 works fine - see results at https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201458940?pr=402
torch=2.0.1 breaks something, see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201406726?pr=402
a different torch=2 test works fine - see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201483327

tcompa commented 1 year ago

Finally found the issue (it's a torch 2.0.1 issue, which is exposed by anndata imports but unrelated to anndata)

Current fix: we have to include torch dependency explicitly, and make it <=2.0.0.

tcompa commented 1 year ago

For the record, the new size of the installed package is quite larger - and I think this is due to the torch 2.0 requirement of nvidia libraries:

$ pwd
/home/tommaso/Fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal

$ du -hs fractal-tasks-core0.10.0a6/
5.4G    fractal-tasks-core0.10.0a6/

$ du -hs fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/* | sort -h | tail -n 5
86M fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/scipy
99M fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/llvmlite
185M    fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/triton
1.3G    fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/torch
2.6G    fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/nvidia

fractal-analytics-platform / fractal-tasks-core

Make torch dependency more flexible #355