Closed Serbernari closed 1 year ago
As a quickfix I chenged max_answers: int = 6 to max_answers: int = 10 and it works flawlessly now
Hey @Serbernari, thank you for reporting this! I can reproduce it as well. By the way, we accept contributions. Would you like to open a PR for your suggested solution? That way, it will be faster for us to fix it. If not, our team will take over this for sure
Hey @bilgeyucel! Just opened a PR (#4817) that solves this issue by auto-determining max_answers
by default. Would be great if you could take a look!
Describe the bug It seems to be the case, when the question have more than 6 answers it can (and do) cause the "index out of bounds error" in some scenarios
Error message
IndexError Traceback (most recent call last)
6 frames
/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in train(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, caching, cache_path, grad_acc_steps, early_stopping, max_query_length) 435 :return: None 436 """ --> 437 return self._training_procedure( 438 data_dir=data_dir, 439 train_filename=train_filename,
/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in _training_procedure(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, teacher_model, teacher_batch_size, caching, cache_path, distillation_loss_weight, distillation_loss, temperature, tinybert, processor, grad_acc_steps, early_stopping, distributed, doc_stride, max_query_length) 269 ) 270 else: # caching would need too much memory for tinybert distillation so in that case we use the default data silo --> 271 data_silo = DataSilo( 272 processor=processor, 273 batch_size=batch_size,
/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in init(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, multiprocessing_strategy, caching, cache_path) 102 # In most cases we want to load all data automatically, but in some cases we rather want to do this 103 # later or load from dicts instead of file --> 104 self._load_data() 105 106 def _get_dataset(self, filename: Optional[Union[str, Path]], dicts: Optional[List[Dict]] = None):
/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts) 163 train_file = self.processor.data_dir / self.processor.train_filename 164 logger.info("Loading train set from: %s ", train_file) --> 165 self.data["train"], self.tensor_names = self._get_dataset(train_file) 166 else: 167 logger.info("No train set is being loaded")
/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _get_dataset(self, filename, dicts) 123 for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"): 124 processing_batch = dicts[i : i + batch_size] --> 125 dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts( 126 dicts=processing_batch, indices=list(range(len(processing_batch))) # TODO remove indices 127 )
/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in dataset_from_dicts(self, dicts, indices, return_baskets, debug) 472 # Convert answers from string to token space, skip this step for inference 473 if not return_baskets: --> 474 baskets = self._convert_answers(baskets) 475 476 # Convert internal representation (nested baskets + samples with mixed types) to pytorch features (arrays of numbers)
/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in _convert_answers(self, baskets) 653 print(self.max_answers) 654 print(i) --> 655 label_idxs[i][0] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_start_t 656 label_idxs[i][1] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_end_t 657 # If the start or end of the span answer is outside the passage, treat passage as no_answer
IndexError: index 6 is out of bounds for axis 0 with size 6
Expected behavior SquadProcessor should be able to work with arbitrary number of answers to any question.
Additional context I think the problem caused by hardcoding magic number in the init of
class SquadProcessor(Processor)
, where is set thatmax_answers: int = 6,
Perhaps, using
len(basket.raw["answers"])
each time could fix that problemTo Reproduce Download PolicyQA dataset from https://github.com/wasiahmad/PolicyQA and launch
FAQ Check
System: