Closed humzaiqbal closed 1 year ago
Hi! I want to finetune the ViT-B-32 using XLA on a TPU v2-8 as well as a TPU v3-8 VM using the following command
python3 /home/ubuntu/CLIPA/clipa_torch/launch_xla.py --num-devices 1 training.main \ --train-data /mnt/test/00000.tar \ --val-data /mnt/test/00001.tar \ --batch-size 10 \ --epochs 2 \ --pretrained openai \ --lr 0.0001 \ --model ViT-B-32 \ --precision 'fp32' \ --train-num-samples 10 \
and I get the following error
src/tcmalloc.cc:332] Attempt to free invalid pointer 0x7ffda3f54f80 https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map= *** SIGABRT received by PID 97581 (TID 97581) on cpu 25 from PID 97581; stack trace: *** PC: @ 0x7f4569e6900b (unknown) raise @ 0x7f441b59fa1a 1152 (unknown) @ 0x7f4569e69090 750026848 (unknown) @ 0xffffffffe0170000 (unknown) (unknown) @ 0xe900000002bffe89 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f4569e6900b,7f441b59fa19,7f4569e6908f,ffffffffe016ffff,e900000002bffe88&map=ceee8fa20ddf9c34af43f587221e91de:7f440e677000-7f441b7b6840 E0815 22:20:13.843517 97581 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked. E0815 22:20:13.843536 97581 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start. E0815 22:20:13.843555 97581 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0815 22:20:13.843560 97581 coredump_hook.cc:512] RAW: Sending fingerprint to remote end. E0815 22:20:13.843566 97581 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket E0815 22:20:13.843574 97581 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running? E0815 22:20:13.843578 97581 coredump_hook.cc:580] RAW: Dumping core locally. E0815 22:20:14.313395 97581 process_state.cc:784] RAW: Raising signal 6 with default behavior
Ubuntu 20.04.6 LTS
TPU v2-8 / TPU v3-8 (tried on both and got an exact repro so its not a TPU type issue it seems)
numpy=1.24.3 torch=2.0.1 torch-xla=2.0 torchmetrics=1.0.3 torchvision=0.16.0a0+0d75d9e tensorboard=2.12.3 tensorboard-data-server=0.7.1 tensorflow=2.12.1 tensorflow-datasets=4.8.2 tensorflow-estimator=2.12.0 tensorflow-hub=0.14.0 tensorflow-io-gcs-filesystem=0.33.0 tensorflow-metadata==.14.0 tensorflow-text=2.12.0
I know based on adding some debugging that this happens when the code calls xm.mark_step here
Any thoughts as to what may be happening or any tricks I can use for debugging?
Managed to fix it with this
Hi! I want to finetune the ViT-B-32 using XLA on a TPU v2-8 as well as a TPU v3-8 VM using the following command
and I get the following error
OS Setup
OS:
Ubuntu 20.04.6 LTS
TPU type:
TPU v2-8 / TPU v3-8 (tried on both and got an exact repro so its not a TPU type issue it seems)
Python packages:
Things I tried
I know based on adding some debugging that this happens when the code calls xm.mark_step here
Any thoughts as to what may be happening or any tricks I can use for debugging?