Closed jluethi closed 1 year ago
I guess this was meant to be a fractal-tasks-core issue (unless its goal is to provide a way to install different packages on different clusters).
Relevant refs on CUDA/pytorch versions and compatibility:
My bad. Yes, it should be a tasks issue :)
And the goal would be to allow an admin setting things up or a user installing the core tasks to get more control about which torch version is used. The effect would be that different torch versions are installed on different clusters. Not sure what the best way to make this happen will be, but it shouldn't be a server concern if at all possible :)
A possible way out would be to add package extras, so that one could install the package as
pip install fractal-tasks-core[pytorch112]
pip install fractal-tasks-core[pytorch113]
Let's rediscuss it.
Optional extras specify the pytorch version
If nothing is specified, pip install cellpose will install something (likely the newest pytorch version)
What is our plan regarding torch versions for the fractal-tasks extra? Not the biggest fan of multiple different extra editions tbh, but would be great to allow the torch installation to work better (i.e. also work "out of the box" on more modern system than the UZH GPUs)
Refs (to explore further):
https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.
See
Maybe doable by combining
[tool.poetry.dependencies]
pathlib2 = { version = "^2.2", markers = "python_version <= '3.4' or sys_platform == 'win32'" }
with
[tool.poetry.dependencies]
foo = [
{version = "<=1.9", python = ">=3.6,<3.8"},
{version = "^2.0", python = ">=3.8"}
]
We explored multiple options with @mfranzon, and we don't see any which makes sense to us via conditional dependencies or something similar. We then propose that:
Since this is very tedious, we also propose the following workaround for doing it automatically (to be included in fractal-server - we can then open issue over there).
The /api/v1/task/collect/pip/
currently takes this request body:
{
"package": "string",
"package_version": "string",
"package_extras": "string",
"python_version": "string"
}
We could add an additional attribute, like custom_package_versions
. This would be empty by default, and only at UZH we would set it to custom_package_versions={"torch": "1.12.1"}
. The behavior of the task collection would then be:
pip install torch==1.12.1
(where pip
is replaced by the actual venv pip
that is being used).CAVEAT: this is messing with the package, and thus creating a not-so-clean log of the installation (although we would still include also the additional torch-installation logs). Such an operation is meant to be restricted to very specific cases, where there is an important dependency on hardware or system libraries - things that a regular user should not be using.
IMPORTANT NOTE 1
This workaround cannot bring us out of the versions supported by fractal-tasks-core (for instance). Say that we now require torch>=1.13.0, and then we set custom_package_versions={"torch": "1.12.1"}
. This task-collection operation will fail, because the installation of the custom package goes conflicts with fractal-tasks-core.
IMPORTANT NOTE 2
We should never use this feature to install an additional package. For instance if the fractal-tasks-core
does not depend on polars
, and we specify custom_package_versions={"polars": "1.0"}
, then task collection will fail.
MINOR NOTE:
This also fits perfectly with https://github.com/fractal-analytics-platform/fractal-server/issues/686, where we would only need to add the same pip install
line in the script.
Thanks for digging into this! Sounds good to me.
I already tested it with torch 2.0.0 on the FMI side and that also works, so I don't see a strong reason for limiting the torch version at all for the time being.
Having the custom_package_versions
sounds convenient for the Pelkmans lab setup. If that's not a major effort, I'd be in favor of having this.
Server work is deferred to https://github.com/fractal-analytics-platform/fractal-server/issues/740. This issue remains for
I have seen no reason for constraints so far, given that 2.0.0 still worked well. We just need torch for cellpose, right? Do we still add it as an explicit dependency for the extras (to make the custom_package_versions
workaround work) or is that not necessary?
Basically, our torch constraint is:
1) Whatever cellpose needs => they define that
2) Whatever local hardware requires (=> custom_package_versions
)
We just need torch for cellpose, right?
Anndata also uses it, but they are not very strict in the dependency version: torch
is not listed as a direct dependency in https://github.com/scverse/anndata/blob/main/pyproject.toml, and pip install anndata
in a fresh environment does not install it. I think they just try to import it, and have a fall-back options if the import fails.
To do:
Note: the list below is a bunch of not-very-systematic tests. This is all preliminary, but it'd be nice to understand things clearly - since we are already at it.
Here are some raw CI tests
Finally found the issue (it's a torch 2.0.1 issue, which is exposed by anndata imports but unrelated to anndata)
Current fix: we have to include torch dependency explicitly, and make it <=2.0.0
.
For the record, the new size of the installed package is quite larger - and I think this is due to the torch 2.0 requirement of nvidia libraries:
$ pwd
/home/tommaso/Fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal
$ du -hs fractal-tasks-core0.10.0a6/
5.4G fractal-tasks-core0.10.0a6/
$ du -hs fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/* | sort -h | tail -n 5
86M fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/scipy
99M fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/llvmlite
185M fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/triton
1.3G fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/torch
2.6G fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/nvidia
Currently, we hardcode torch version 1.12 in the fractal-tasks-core dependencies to make it work well on older UZH GPUs. The tasks themselves don't depend on that torch version though and run fine in other torch versions (e.g. 1.13 or even the new 2.0.0).
The 1.12 dependency made some issues on @gusqgm Windows Subsystem Linux test. On the FMI cluster, it's fine on some GPU nodes, but actually runs in the error below on other GPU nodes. I tested with torch 2.0.0 now and then everything works.
Thus, we should make the torch version more flexible. The correct torch version to install depends on the infrastructure, not the task package.
A workaround until we have it is to manually install torch of a given version into the task venv:
If someone is searching for it, I'm hitting this error message when the torch version doesn't match: