deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

DataSilo fails to load data #838

Closed EddyDavies closed 2 years ago

EddyDavies commented 3 years ago

Describe the bug As DataSilo is being run to load the training data it just errors out, not sure why, sorry I don't have more details. Following the notebook in this article.

Error message

Preprocessing Dataset ../data/fine_tune_bert/train.csv:  53%|███████████████████████████████████████████████▍                                          | 674000/1279999 [02:24<02:09, 4673.22 Dicts/s]
Traceback (most recent call last):
  File "bert_farm.py", line 87, in <module>
    data_silo = DataSilo(
  File "/home/ubuntu/Trade_With_Twitter/venv/lib/python3.8/site-packages/farm/data_handler/data_silo.py", line 113, in __init__
    self._load_data()
  File "/home/ubuntu/Trade_With_Twitter/venv/lib/python3.8/site-packages/farm/data_handler/data_silo.py", line 222, in _load_data
    self.data["train"], self.tensor_names = self._get_dataset(train_file)
  File "/home/ubuntu/Trade_With_Twitter/venv/lib/python3.8/site-packages/farm/data_handler/data_silo.py", line 185, in _get_dataset
    for dataset, tensor_names, problematic_samples in results:
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '(<torch.utils.data.dataset.TensorDataset object at 0x7f8ccfcc3700>, ['input_ids', 'padding_mask', 'segment_ids', 'text_classification_label_ids'], set())'. Reason: 'RuntimeError('unable to write to file </torch_1395_3740776664>')'

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here, like type of downstream task, part of etc..

To Reproduce Steps to reproduce the behavior

System:

julian-risch commented 2 years ago

Hi @EddyDavies the DataSilo has a parameter max_processes. I suggest that you set this parameter to 1 to disable multiprocessing and see whether that is a workaround https://github.com/deepset-ai/FARM/blob/004ed2c6dbd9d7a2af85d57ec89a689b9363e6c1/farm/data_handler/data_silo.py#L48 We had a similar issue a long time ago but it was for a different reason #97 Your error message says Reason: 'RuntimeError('unable to write to file </torch_1395_3740776664>'). Maybe it's a problem with write permissions?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.