ServiceNow / picard

PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. PICARD is a ServiceNow Research project that was started at Element AI.
https://arxiv.org/abs/2109.05093
Apache License 2.0
339 stars 122 forks source link

"make train" stuck at Training #140

Open ravidborse opened 1 year ago

ravidborse commented 1 year ago

From last one hour it's stuck at

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 402.99it/s] 08/28/2023 18:58:48 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-7e03875afb379451.arrow 08/28/2023 18:58:48 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-06decf315ea7a716.arrow 08/28/2023 18:58:49 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-6ef067fed50d786a.arrow 08/28/2023 18:58:49 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /transformers_cache/spider/spider/1.0.0/df8615a31625b12f701e3840f2502d74f4b533dc60aa364a1f48cfd198acc326/cache-e3414ffb7b73b322.arrow 08/28/2023 18:58:51 - WARNING - seq2seq.utils.dataset_loader - The split train of the dataset spider contains 8 duplicates out of 7000 examples Running training Num examples = 7000 Num Epochs = 3072 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 2050 Gradient Accumulation steps = 410 Total optimization steps = 9216 0%| | 0/9216 [00:00<?, ?it/s]

ravidborse commented 1 year ago

Actually its stuck at Torch Autograd Backward

Running training Num examples = 7000 Num Epochs = 3072 Instantaneous batch size per device = 5 Total train batch size (w. parallel, distributed & accumulation) = 2050 Gradient Accumulation steps = 410 Total optimization steps = 9216 0%| | 0/9216 [00:00<?, ?it/s]^CTraceback (most recent call last): File "seq2seq/run_seq2seq.py", line 271, in main() File "seq2seq/run_seq2seq.py", line 216, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1400, in train tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 2002, in training_step loss.backward() File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag