Can't launch on GCP (Nvidia T4/L4)

filipovichespri commented 1 month ago

@Zheng-Chong, thank's for your awesome VTON solution!

I have a problem with launching gradio example on GCP Compute Engine with T4/L4 cards (16/24 GB vRAM). When I'm trying to launch example, i got this error:

Traceback (most recent call last):
  File "/home/***/ne/CatVTON/app.py", line 116, in <module>
    automasker = AutoMasker(
  File "/home/***/ne/CatVTON/model/cloth_masker.py", line 166, in __init__
    self.schp_processor_atr = SCHP(ckpt_path=os.path.join(schp_ckpt, 'exp-schp-201908301523-atr.pth'), device=device)
  File "/home/***/ne/CatVTON/model/SCHP/__init__.py", line 73, in __init__
    self.load_ckpt(ckpt_path)
  File "/home/***/ne/CatVTON/model/SCHP/__init__.py", line 104, in load_ckpt
    self.model.load_state_dict(new_state_dict_, strict=False)
  File "/home/***/miniconda3/envs/catvton/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ResNet:
        size mismatch for fushion.4.weight: copying a param with shape torch.Size([18, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([20, 256, 1, 1]).
        size mismatch for fushion.4.bias: copying a param with shape torch.Size([18]) from checkpoint, the shape in current model is torch.Size([20]).

Please, could you explain, why is this happening?

Zheng-Chong commented 1 month ago

It means that exp-schp-201908301523-atr.pth in your device is not match the SCHP model. exp-schp-201908301523-atr.pth weight has 18 channels is correct, but the SCHP model has 20 channels is not correct. It may cause by code modification. Make sure your local repo is updated and the same with the github repo.

filipovichespri commented 1 month ago

@Zheng-Chong well, it was 'git clone' on just created VM, no code modifications at all, are there any env vars or machine states that could potentially change expected behaviour?

UPD. Compared files with working local deployment, all files (even inside .cache folder) are the same (checksum'ed) Git diff output:

diff --git a/__pycache__/utils.cpython-39.pyc b/__pycache__/utils.cpython-39.pyc
index 0a3728d..b9ef844 100644
Binary files a/__pycache__/utils.cpython-39.pyc and b/__pycache__/utils.cpython-39.pyc differ
diff --git a/model/SCHP/networks/__pycache__/AugmentCE2P.cpython-39.pyc b/model/SCHP/networks/__pycache__/AugmentCE2P.cpython-39.pyc
index af91fbe..f71b379 100644
Binary files a/model/SCHP/networks/__pycache__/AugmentCE2P.cpython-39.pyc and b/model/SCHP/networks/__pycache__/AugmentCE2P.cpython-39.pyc differ
diff --git a/model/SCHP/networks/__pycache__/__init__.cpython-39.pyc b/model/SCHP/networks/__pycache__/__init__.cpython-39.pyc
index 2468e7a..c164d91 100644
Binary files a/model/SCHP/networks/__pycache__/__init__.cpython-39.pyc and b/model/SCHP/networks/__pycache__/__init__.cpython-39.pyc differ
diff --git a/model/SCHP/utils/__pycache__/transforms.cpython-39.pyc b/model/SCHP/utils/__pycache__/transforms.cpython-39.pyc
index fa72475..9f77ca2 100644
Binary files a/model/SCHP/utils/__pycache__/transforms.cpython-39.pyc and b/model/SCHP/utils/__pycache__/transforms.cpython-39.pyc differ

Same error after resetting pycache's via 'git reset'.

inxi -Fxz output:

System:    Kernel: 5.15.0-1069-gcp x86_64 bits: 64 compiler: N/A Console: tty 0 Distro: Ubuntu 20.04.6 LTS (Focal Fossa) 
Machine:   Type: Kvm Mobo: Google model: Google Compute Engine serial: <filter> UEFI: Google v: Google date: 09/13/2024 
CPU:       Topology: Quad Core model: Intel Xeon bits: 64 type: MT MCP arch: Cascade Lake rev: 7 L2 cache: 38.5 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 bogomips: 35202 
           Speed: 2200 MHz min/max: N/A Core speeds (MHz): 1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 2200 7: 2200 8: 2200 
Graphics:  Device-1: NVIDIA driver: nvidia v: 550.90.07 bus ID: 00:03.0 
           Display: server: No display server data found. Headless machine? tty: 286x65 
           Message: Advanced graphics data unavailable in console. Try -G --display 
Audio:     Message: No Device data found. 
Network:   Device-1: Intel 82371AB/EB/MB PIIX4 ACPI type: network bridge driver: N/A port: N/A bus ID: 00:01.3 
           Device-2: Red Hat Virtio network driver: virtio-pci v: 1 port: c000 bus ID: 00:04.0 
           IF: ens4 state: up speed: -1 duplex: unknown mac: <filter> 
Drives:    Local Storage: total: 200.00 GiB used: 122.29 GiB (61.1%) 
           ID-1: /dev/nvme0n1 model: nvme_card-pd size: 200.00 GiB temp: 30 C 
Partition: ID-1: / size: 193.65 GiB used: 122.29 GiB (63.1%) fs: ext4 dev: /dev/nvme0n1p1 
Sensors:   Message: No sensors data was found. Is sensors configured? 
Info:      Processes: 181 Uptime: 1m Memory: 31.34 GiB used: 480.1 MiB (1.5%) Init: systemd runlevel: 5 Compilers: gcc: 9.4.0 
           Shell: bash v: 5.0.17 inxi: 3.0.38

nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   41C    P8             12W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Zheng-Chong / CatVTON

Can't launch on GCP (Nvidia T4/L4) #63