deepset-ai / haystack-tutorials

Here you can find all the Tutorials for Haystack πŸ““
https://haystack.deepset.ai/tutorials
Apache License 2.0
225 stars 80 forks source link

tutorials 09_dpr_training - Training issue #317

Open choi-yongsuk opened 2 months ago

choi-yongsuk commented 2 months ago

Describe the issue The following error occurred while running "tutorials 09_dpr_training" in Google Colab.

To Reproduce https://haystack.deepset.ai/tutorials/09_dpr_training https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/09_DPR_training.ipynb

  1. Installing Haystack - ok
  2. Enabling Telemetry - ok
  3. Logging - ok
  4. Download Original DPR Training Data - ok
  5. Option 1: Training DPR from Scratch - ok
  6. Initialization - ok
  7. Training - error
# Start training our model and save it when it is finished

retriever.train(
    data_dir=doc_dir,
    train_filename=train_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=1,
    batch_size=16,
    grad_acc_steps=8,
    save_dir=save_dir,
    evaluate_every=3000,
    embed_title=True,
    num_positives=1,
    num_hard_negatives=1,
)

INFO:haystack.modeling.data_handler.data_silo:
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|
 (o)(o)------'\ _ /     ( )

INFO:haystack.modeling.data_handler.data_silo:LOADING TRAIN DATA
INFO:haystack.modeling.data_handler.data_silo:==================
INFO:haystack.modeling.data_handler.data_silo:Loading train set from: data/tutorial9/train/biencoder-nq-train.json 
Preprocessing dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 115/115 [01:28<00:00,  1.30 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING DEV DATA
INFO:haystack.modeling.data_handler.data_silo:=================
INFO:haystack.modeling.data_handler.data_silo:Loading dev set from: data/tutorial9/dev/biencoder-nq-dev.json
Preprocessing dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:09<00:00,  1.36 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING TEST DATA
INFO:haystack.modeling.data_handler.data_silo:=================
INFO:haystack.modeling.data_handler.data_silo:Loading test set from: data/tutorial9/dev/biencoder-nq-dev.json
Preprocessing dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:09<00:00,  1.43 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:DATASETS SUMMARY
INFO:haystack.modeling.data_handler.data_silo:================
INFO:haystack.modeling.data_handler.data_silo:Examples in train: 58880
INFO:haystack.modeling.data_handler.data_silo:Examples in dev  : 6515
INFO:haystack.modeling.data_handler.data_silo:Examples in test : 6515
INFO:haystack.modeling.data_handler.data_silo:Total examples   : 71910
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:Longest query length observed after clipping: 31   - for max_query_len: 64
INFO:haystack.modeling.data_handler.data_silo:Average query length after clipping:          11.807353940217391
INFO:haystack.modeling.data_handler.data_silo:Proportion queries clipped:                   0.0
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:Longest passage length observed after clipping: 223.0   - for max_passage_len: 256
INFO:haystack.modeling.data_handler.data_silo:Average passage length after clipping:          139.19262907608694
INFO:haystack.modeling.data_handler.data_silo:Proportion passages clipped:                    0.0
INFO:haystack.modeling.model.optimization:Loading optimizer 'AdamW': {'correct_bias': True, 'weight_decay': 0.0, 'eps': 1e-08, 'lr': 1e-05}
INFO:haystack.modeling.model.optimization:Using scheduler 'get_linear_schedule_with_warmup'
INFO:haystack.modeling.model.optimization:Loading schedule 'get_linear_schedule_with_warmup': '{'num_warmup_steps': 100, 'num_training_steps': 460}'
INFO:haystack.modeling.training.base:No train checkpoints found. Starting a new training ...
Train epoch 0/0 (Cur. train loss: 0.0000):   0%|          | 0/3680 [00:00<?, ?it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-7-1d070a39096a>](https://localhost:8080/#) in <cell line: 3>()
      1 # Start training our model and save it when it is finished
      2 
----> 3 retriever.train(
      4     data_dir=doc_dir,
      5     train_filename=train_filename,

7 frames
[/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py](https://localhost:8080/#) in _get_default_group()
    975     """Get the default process group created by init_process_group."""
    976     if not is_initialized():
--> 977         raise ValueError(
    978             "Default process group has not been initialized, "
    979             "please make sure to call init_process_group."

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Expected behavior Data loading and preprocessing were completed, but a problem occurred in initializing the process group, and β€œValueError: Default process group has not been initialized, please make sure to call init_process_group.” It stops after an error occurs.

What environment did you try to run the tutorial on?:

jinruiyang commented 1 month ago

same here! did you find way to fix it?