Hardcoded maximum number of answers causes "index out of bounds error"

Serbernari commented 1 year ago

Describe the bug It seems to be the case, when the question have more than 6 answers it can (and do) cause the "index out of bounds error" in some scenarios

Error message

IndexError Traceback (most recent call last)

in ----> 1 reader.train(data_dir=data_dir, 2 train_filename="train.json", 3 test_filename="test.json", 4 dev_filename="dev.json", 5 use_gpu=True,

6 frames

/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in train(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, caching, cache_path, grad_acc_steps, early_stopping, max_query_length) 435 :return: None 436 """ --> 437 return self._training_procedure( 438 data_dir=data_dir, 439 train_filename=train_filename,

/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in _training_procedure(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, teacher_model, teacher_batch_size, caching, cache_path, distillation_loss_weight, distillation_loss, temperature, tinybert, processor, grad_acc_steps, early_stopping, distributed, doc_stride, max_query_length) 269 ) 270 else: # caching would need too much memory for tinybert distillation so in that case we use the default data silo --> 271 data_silo = DataSilo( 272 processor=processor, 273 batch_size=batch_size,

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in init(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, multiprocessing_strategy, caching, cache_path) 102 # In most cases we want to load all data automatically, but in some cases we rather want to do this 103 # later or load from dicts instead of file --> 104 self._load_data() 105 106 def _get_dataset(self, filename: Optional[Union[str, Path]], dicts: Optional[List[Dict]] = None):

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts) 163 train_file = self.processor.data_dir / self.processor.train_filename 164 logger.info("Loading train set from: %s ", train_file) --> 165 self.data["train"], self.tensor_names = self._get_dataset(train_file) 166 else: 167 logger.info("No train set is being loaded")

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _get_dataset(self, filename, dicts) 123 for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"): 124 processing_batch = dicts[i : i + batch_size] --> 125 dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts( 126 dicts=processing_batch, indices=list(range(len(processing_batch))) # TODO remove indices 127 )

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in dataset_from_dicts(self, dicts, indices, return_baskets, debug) 472 # Convert answers from string to token space, skip this step for inference 473 if not return_baskets: --> 474 baskets = self._convert_answers(baskets) 475 476 # Convert internal representation (nested baskets + samples with mixed types) to pytorch features (arrays of numbers)

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in _convert_answers(self, baskets) 653 print(self.max_answers) 654 print(i) --> 655 label_idxs[i][0] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_start_t 656 label_idxs[i][1] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_end_t 657 # If the start or end of the span answer is outside the passage, treat passage as no_answer

IndexError: index 6 is out of bounds for axis 0 with size 6

Expected behavior SquadProcessor should be able to work with arbitrary number of answers to any question.

Additional context I think the problem caused by hardcoding magic number in the init of class SquadProcessor(Processor), where is set that max_answers: int = 6,

Perhaps, using len(basket.raw["answers"]) each time could fix that problem

To Reproduce Download PolicyQA dataset from https://github.com/wasiahmad/PolicyQA and launch

reader = FARMReader(model_name_or_path="ahotrod/electra_large_discriminator_squad2_512", use_gpu=True)
data_dir = "/content/PolicyQA/data"
reader.train(data_dir=data_dir, 
            train_filename="train.json", 
            test_filename="test.json",
            dev_filename="dev.json",
            use_gpu=True, 
            n_epochs=1,
            batch_size = 1,
            max_seq_len = 512,             
            save_dir="my_model",
            )

FAQ Check

[+] Have you had a look at our new FAQ page?

System:

OS: linux
GPU/CPU: tesla v100
Haystack version (commit or version number): 1.14.0
DocumentStore: none
Reader: FARMReader(model_name_or_path="ahotrod/electra_large_discriminator_squad2_512", use_gpu=True)
Retriever: none

Serbernari commented 1 year ago

As a quickfix I chenged max_answers: int = 6 to max_answers: int = 10 and it works flawlessly now

bilgeyucel commented 1 year ago

Hey @Serbernari, thank you for reporting this! I can reproduce it as well. By the way, we accept contributions. Would you like to open a PR for your suggested solution? That way, it will be faster for us to fix it. If not, our team will take over this for sure

benheckmann commented 1 year ago

Hey @bilgeyucel! Just opened a PR (#4817) that solves this issue by auto-determining max_answers by default. Would be great if you could take a look!

deepset-ai / haystack

Hardcoded maximum number of answers causes "index out of bounds error" #4320

Error message