UCSC-VLAA / CLIPA

[NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"
Apache License 2.0
298 stars 12 forks source link

Invalid pointer error on torch XLA #5

Closed humzaiqbal closed 1 year ago

humzaiqbal commented 1 year ago

Hi! I want to finetune the ViT-B-32 using XLA on a TPU v2-8 as well as a TPU v3-8 VM using the following command

python3 /home/ubuntu/CLIPA/clipa_torch/launch_xla.py --num-devices 1 training.main \
    --train-data /mnt/test/00000.tar \
    --val-data /mnt/test/00001.tar \
    --batch-size 10 \
    --epochs 2 \
    --pretrained openai \
    --lr 0.0001 \
    --model ViT-B-32 \
    --precision 'fp32' \
    --train-num-samples 10 \

and I get the following error

src/tcmalloc.cc:332] Attempt to free invalid pointer 0x7ffda3f54f80
https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map=
*** SIGABRT received by PID 97581 (TID 97581) on cpu 25 from PID 97581; stack trace: ***
PC: @     0x7f4569e6900b  (unknown)  raise
    @     0x7f441b59fa1a       1152  (unknown)
    @     0x7f4569e69090  750026848  (unknown)
    @ 0xffffffffe0170000  (unknown)  (unknown)
    @ 0xe900000002bffe89  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f441b59fa19,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map=ceee8fa20ddf9c34af43f587221e91de:7f440e677000-7f441b7b6840
E0815 22:20:13.843517   97581 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked.
E0815 22:20:13.843536   97581 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start.
E0815 22:20:13.843555   97581 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0815 22:20:13.843560   97581 coredump_hook.cc:512] RAW: Sending fingerprint to remote end.
E0815 22:20:13.843566   97581 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0815 22:20:13.843574   97581 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0815 22:20:13.843578   97581 coredump_hook.cc:580] RAW: Dumping core locally.
E0815 22:20:14.313395   97581 process_state.cc:784] RAW: Raising signal 6 with default behavior

OS Setup

OS:

Ubuntu 20.04.6 LTS

TPU type:

TPU v2-8 / TPU v3-8 (tried on both and got an exact repro so its not a TPU type issue it seems)

Python packages:

numpy=1.24.3
torch=2.0.1
torch-xla=2.0
torchmetrics=1.0.3
torchvision=0.16.0a0+0d75d9e
tensorboard=2.12.3
tensorboard-data-server=0.7.1
tensorflow=2.12.1
tensorflow-datasets=4.8.2
tensorflow-estimator=2.12.0
tensorflow-hub=0.14.0
tensorflow-io-gcs-filesystem=0.33.0
tensorflow-metadata==.14.0
tensorflow-text=2.12.0

Things I tried

I know based on adding some debugging that this happens when the code calls xm.mark_step here

Any thoughts as to what may be happening or any tricks I can use for debugging?

humzaiqbal commented 1 year ago

Managed to fix it with this