amzn / pecos

PECOS - Prediction for Enormous and Correlated Spaces
https://libpecos.org/
Apache License 2.0
517 stars 105 forks source link

Training process freezes without using GPUs #255

Open TOP-RX opened 1 year ago

TOP-RX commented 1 year ago

Description

I just simply try to run the code for GIANT-XRT training process for ogbn-arxiv, but it seems the code freezes without allocating any GPUs for training.

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Please provide minimal example of code snippet that reproduces the error. For existing examples, please provide link.)

data_dir=./proc_data_xrt/ogbn-arxiv
bash xrt_train.sh ${data_dir}

(Paste the commands you ran that produced the error.)

1.data_dir=./proc_data_xrt/ogbn-arxiv bash xrt_train.sh ${data_dir} 2.

What have you tried to solve it?

1. 2.

Error message or code output

The code stuck here. And no GPUs are used.

warnings.warn(
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher - ***** Running training *****
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Num examples = 169286
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Num labels = 32
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Num Epochs = 4
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Learning Rate Schedule = linear
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Batch size = 256
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Gradient Accumulation steps = 1
09/24/2023 01:46:52 - INFO - pecos.xmc.xtransformer.matcher -   Total optimization steps = 2500

Environment

(Add as much information about your environment as possible, e.g. dependencies versions.)

Dong3759 commented 4 months ago

haved you solved?how to solve, I am the same with you