comic / evalutils

evalutils helps users create extensions for grand-challenge.org
https://grand-challenge.org
MIT License
23 stars 9 forks source link

Github CI: LFS support #325

Closed ghost closed 2 years ago

ghost commented 2 years ago

Description

What I was trying to do (and succeeded): Pushing a change to my github repo.

On the github side of things, the ci.yml workflow is started on push, as expected. My container builds successfully but fails during the test run. The failure however is when loading the pytorch pre-trained weights file. This failure only happens with the Github CI environment. Locally it all works fine.

What I Did / What Happened

cmd: git push

from the github gui / actions / workflow log:

Run ./test.sh
*** VOLUME_SUFFIX generation
1+0 records in
1+0 records out
32 bytes copied, 3.4e-05 s, 941 kB/s
*** docker, create transient
cxragecontainer-output-6a43209ce50d4ecacd1ba3746d2af67f
*** docker, run test
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/algorithm/process.py", line 43, in <module>
    CxrAgeContainer().process()
  File "/opt/algorithm/process.py", line 38, in process
    outputs = CxrAgeContainer.predict(inputs=inputs)
  File "/opt/algorithm/process.py", line 31, in predict
    age_detector_nn = AgeDetectorNN(data_loader=inputs)
  File "/opt/algorithm/age_detector_nn.py", line 19, in __init__
    self.__set_model_params(weights_filename=AgeDetectorNN.WEIGHTS_FILE)
  File "/opt/algorithm/age_detector_nn.py", line 26, in __set_model_params
    self.__load_initial_model_weights(weights_filename=weights_filename)
  File "/opt/algorithm/age_detector_nn.py", line 32, in __load_initial_model_weights
    self.__model_age.load_state_dict(torch.load(weights_pathname))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.
Error: Process completed with exit code 1.

So thie core issue is this: _pickle.UnpicklingError: invalid load key, 'v'.

The weights are stored in my github repositiory (as required) using git lfs support. All works fine locally and in fact on three separate client machines, all locally (or via ssh). However, it fails on Gihub CI.

I suspect the git lfs support isn't working in some way in the Github CI environment and instead of getting my actual 43MB of weights files, it is loading the git lfs pointer to it.

jmsmkn commented 2 years ago

See https://github.com/actions/checkout#usage and the lfs attribute, don't know if we should set this to true by default, I think that even if you use GitHub as the backend it still counts towards the monthly download limit.

jmsmkn commented 2 years ago

I'm fairly sure the solution to this is to add

  with:
    lfs: true

To the generated ci.yml. Given the costs (cash money) I don't think we should enable it by default and leave this to an exercise for the user.