loading checkpoints onto cpu machine

bardia01 commented 5 months ago

Question: When loading a trained model onto a cpu machine, an error occurs in mase/machop/chop/tools/checkpoint_load.py due to not having a CPU

Commit hash: https://github.com/DeepWok/mase/commit/e98079f83827c3457be56e7f2b83d22d62fe780f

Command to reproduce: ./ch search --accelerator cpu --config configs/examples/jsc_bardia_by_type.toml --load /content/mase/mase_output/jsc_bardia_e_50_b_128_l_001/software/training_ckpts/best.ckpt

Error log:

Traceback (most recent call last):
  File "/content/mase/machop/./ch", line 6, in <module>
    ChopCLI().run()
  File "/content/mase/machop/chop/cli.py", line 270, in run
    run_action_fn()
  File "/content/mase/machop/chop/cli.py", line 395, in _run_search
    search(**search_params)
  File "/content/mase/machop/chop/actions/search/search.py", line 58, in search
    model = load_model(load_name=load_name, load_type=load_type, model=model)
  File "/content/mase/machop/chop/tools/checkpoint_load.py", line 84, in load_model
    model = load_lightning_ckpt_to_unwrapped_model(
  File "/content/mase/machop/chop/tools/checkpoint_load.py", line 15, in load_lightning_ckpt_to_unwrapped_model
    src_state_dict = torch.load(checkpoint)["state_dict"]
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1392, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1366, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 381, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 274, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 258, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Comments: Please would you consider changing "src_state_dict = torch.load(checkpoint)["state_dict"]" to something like:

if(torch.cuda.is_available()) : src_state_dict = torch.load(checkpoint)["state_dict"]
    else: src_state_dict= torch.load(checkpoint, map_location=torch.device('cpu'))["state_dict"]

so that this doesn't break when using CPU

The accelerator flag doesn't seem to help - the print below shows that the accelerator is correctly overridden to cpu

+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name                    |         Default          | Config. File |     Manual Override      |        Effective         |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task                    |      classification      |     cls      |                          |           cls            |
| load_name               |           None           |              | /content/mase/mase_outpu | /content/mase/mase_outpu |
|                         |                          |              | t/jsc_bardia_e_50_b_128_ | t/jsc_bardia_e_50_b_128_ |
|                         |                          |              | l_001/software/training_ | l_001/software/training_ |
|                         |                          |              |     ckpts/best.ckpt      |     ckpts/best.ckpt      |
| load_type               |            mz            |      pl      |                          |            pl            |
| batch_size              |           128            |     512      |                          |           512            |
| to_debug                |          False           |              |                          |          False           |
| log_level               |           info           |              |                          |           info           |
| report_to               |       tensorboard        |              |                          |       tensorboard        |
| seed                    |            0             |      42      |                          |            42            |
| quant_config            |           None           |              |                          |           None           |
| training_optimizer      |           adam           |              |                          |           adam           |
| trainer_precision       |         16-mixed         |              |                          |         16-mixed         |
| learning_rate           |          1e-05           |     0.01     |                          |           0.01           |
| weight_decay            |            0             |              |                          |            0             |
| max_epochs              |            20            |      5       |                          |            5             |
| max_steps               |            -1            |              |                          |            -1            |
| accumulate_grad_batches |            1             |              |                          |            1             |
| log_every_n_steps       |            50            |      5       |                          |            5             |
| num_workers             |            2             |              |                          |            2             |
| num_devices             |            1             |              |                          |            1             |
| num_nodes               |            1             |              |                          |            1             |
| accelerator             |           auto           |     cpu      |           cpu            |           cpu            |
| strategy                |           auto           |              |                          |           auto           |
| is_to_auto_requeue      |          False           |              |                          |          False           |
| github_ci               |          False           |              |                          |          False           |
| disable_dataset_cache   |          False           |              |                          |          False           |
| target                  |   xcu250-figd2104-2L-e   |              |                          |   xcu250-figd2104-2L-e   |
| num_targets             |           100            |              |                          |           100            |
| is_pretrained           |          False           |              |                          |          False           |
| max_token_len           |           512            |              |                          |           512            |
| project_dir             | /content/mase/mase_outpu |              |                          | /content/mase/mase_outpu |
|                         |            t             |              |                          |            t             |
| project                 |           None           |   jsc-tiny   |                          |         jsc-tiny         |
| model                   |           None           |  jsc-bardia  |                          |        jsc-bardia        |
| dataset                 |           None           |     jsc      |                          |           jsc            |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+

Aaron-Zhao123 commented 5 months ago

Can you please follow the template for filing an issue?

Additionally, the proposal presented here doesn't seem feasible to me at the moment. Perhaps exploring the option --accelerator cpu might yield something different for you? However, troubleshooting would be more straightforward if you included your command, as exemplified in the provided template.

bardia01 commented 5 months ago

Hi, I changed the format - using the accelerator flag doesn't seem to help. Additionally, I made the change I suggested locally and it does fix the issue

firemountain154B commented 5 months ago

Hi bardia, It seems it not follows the mase coding consistency. You can try the modification in commit: 6947cc3f50f7cd1e71414ec2fbc1024421a1c7a3 to check whether it can solve your problem or not.

DeepWok / mase

loading checkpoints onto cpu machine #46