Invalid pointer error on torch XLA

Hi! I want to finetune the ViT-B-32 using XLA on a TPU v2-8 as well as a TPU v3-8 VM using the following command

python3 /home/ubuntu/CLIPA/clipa_torch/launch_xla.py --num-devices 1 training.main \
    --train-data /mnt/test/00000.tar \
    --val-data /mnt/test/00001.tar \
    --batch-size 10 \
    --epochs 2 \
    --pretrained openai \
    --lr 0.0001 \
    --model ViT-B-32 \
    --precision 'fp32' \
    --train-num-samples 10 \

and I get the following error

src/tcmalloc.cc:332] Attempt to free invalid pointer 0x7ffda3f54f80
https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map=
*** SIGABRT received by PID 97581 (TID 97581) on cpu 25 from PID 97581; stack trace: ***
PC: @     0x7f4569e6900b  (unknown)  raise
    @     0x7f441b59fa1a       1152  (unknown)
    @     0x7f4569e69090  750026848  (unknown)
    @ 0xffffffffe0170000  (unknown)  (unknown)
    @ 0xe900000002bffe89  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f441b59fa19,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map=ceee8fa20ddf9c34af43f587221e91de:7f440e677000-7f441b7b6840
E0815 22:20:13.843517   97581 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked.
E0815 22:20:13.843536   97581 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start.
E0815 22:20:13.843555   97581 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0815 22:20:13.843560   97581 coredump_hook.cc:512] RAW: Sending fingerprint to remote end.
E0815 22:20:13.843566   97581 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0815 22:20:13.843574   97581 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0815 22:20:13.843578   97581 coredump_hook.cc:580] RAW: Dumping core locally.
E0815 22:20:14.313395   97581 process_state.cc:784] RAW: Raising signal 6 with default behavior

OS Setup

OS:

Ubuntu 20.04.6 LTS

TPU type:

TPU v2-8 / TPU v3-8 (tried on both and got an exact repro so its not a TPU type issue it seems)

Python packages:

numpy=1.24.3
torch=2.0.1
torch-xla=2.0
torchmetrics=1.0.3
torchvision=0.16.0a0+0d75d9e
tensorboard=2.12.3
tensorboard-data-server=0.7.1
tensorflow=2.12.1
tensorflow-datasets=4.8.2
tensorflow-estimator=2.12.0
tensorflow-hub=0.14.0
tensorflow-io-gcs-filesystem=0.33.0
tensorflow-metadata==.14.0
tensorflow-text=2.12.0

Things I tried

I know based on adding some debugging that this happens when the code calls xm.mark_step here

Any thoughts as to what may be happening or any tricks I can use for debugging?

UCSC-VLAA / CLIPA

Invalid pointer error on torch XLA #5

OS Setup

OS:

TPU type:

Python packages:

Things I tried