deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.23k stars 1.89k forks source link

Hardcoded maximum number of answers causes "index out of bounds error" #4320

Closed Serbernari closed 1 year ago

Serbernari commented 1 year ago

Describe the bug It seems to be the case, when the question have more than 6 answers it can (and do) cause the "index out of bounds error" in some scenarios

Error message

IndexError Traceback (most recent call last)

in ----> 1 reader.train(data_dir=data_dir, 2 train_filename="train.json", 3 test_filename="test.json", 4 dev_filename="dev.json", 5 use_gpu=True,

6 frames

/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in train(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, caching, cache_path, grad_acc_steps, early_stopping, max_query_length) 435 :return: None 436 """ --> 437 return self._training_procedure( 438 data_dir=data_dir, 439 train_filename=train_filename,

/usr/local/lib/python3.8/dist-packages/haystack/nodes/reader/farm.py in _training_procedure(self, data_dir, train_filename, dev_filename, test_filename, use_gpu, devices, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp, checkpoint_root_dir, checkpoint_every, checkpoints_to_keep, teacher_model, teacher_batch_size, caching, cache_path, distillation_loss_weight, distillation_loss, temperature, tinybert, processor, grad_acc_steps, early_stopping, distributed, doc_stride, max_query_length) 269 ) 270 else: # caching would need too much memory for tinybert distillation so in that case we use the default data silo --> 271 data_silo = DataSilo( 272 processor=processor, 273 batch_size=batch_size,

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in init(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, multiprocessing_strategy, caching, cache_path) 102 # In most cases we want to load all data automatically, but in some cases we rather want to do this 103 # later or load from dicts instead of file --> 104 self._load_data() 105 106 def _get_dataset(self, filename: Optional[Union[str, Path]], dicts: Optional[List[Dict]] = None):

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts) 163 train_file = self.processor.data_dir / self.processor.train_filename 164 logger.info("Loading train set from: %s ", train_file) --> 165 self.data["train"], self.tensor_names = self._get_dataset(train_file) 166 else: 167 logger.info("No train set is being loaded")

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/data_silo.py in _get_dataset(self, filename, dicts) 123 for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"): 124 processing_batch = dicts[i : i + batch_size] --> 125 dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts( 126 dicts=processing_batch, indices=list(range(len(processing_batch))) # TODO remove indices 127 )

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in dataset_from_dicts(self, dicts, indices, return_baskets, debug) 472 # Convert answers from string to token space, skip this step for inference 473 if not return_baskets: --> 474 baskets = self._convert_answers(baskets) 475 476 # Convert internal representation (nested baskets + samples with mixed types) to pytorch features (arrays of numbers)

/usr/local/lib/python3.8/dist-packages/haystack/modeling/data_handler/processor.py in _convert_answers(self, baskets) 653 print(self.max_answers) 654 print(i) --> 655 label_idxs[i][0] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_start_t 656 label_idxs[i][1] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_end_t 657 # If the start or end of the span answer is outside the passage, treat passage as no_answer

IndexError: index 6 is out of bounds for axis 0 with size 6

Expected behavior SquadProcessor should be able to work with arbitrary number of answers to any question.

Additional context I think the problem caused by hardcoding magic number in the init of class SquadProcessor(Processor), where is set that max_answers: int = 6,

Perhaps, using len(basket.raw["answers"]) each time could fix that problem

To Reproduce Download PolicyQA dataset from https://github.com/wasiahmad/PolicyQA and launch

reader = FARMReader(model_name_or_path="ahotrod/electra_large_discriminator_squad2_512", use_gpu=True)
data_dir = "/content/PolicyQA/data"
reader.train(data_dir=data_dir, 
            train_filename="train.json", 
            test_filename="test.json",
            dev_filename="dev.json",
            use_gpu=True, 
            n_epochs=1,
            batch_size = 1,
            max_seq_len = 512,             
            save_dir="my_model",
            )

FAQ Check

System:

Serbernari commented 1 year ago

As a quickfix I chenged max_answers: int = 6 to max_answers: int = 10 and it works flawlessly now

bilgeyucel commented 1 year ago

Hey @Serbernari, thank you for reporting this! I can reproduce it as well. By the way, we accept contributions. Would you like to open a PR for your suggested solution? That way, it will be faster for us to fix it. If not, our team will take over this for sure

benheckmann commented 1 year ago

Hey @bilgeyucel! Just opened a PR (#4817) that solves this issue by auto-determining max_answers by default. Would be great if you could take a look!