ERROR - haystack.modeling.model.predictions - Invalid end offset >on tutorial notebook

ghost commented 2 years ago

Describe the issue

Hello. I'm testing the first tutorial as it is with around 5000 text files, some are 1 page some are 15 pages long. When the answer is getting printed I get this error. ERROR - haystack.modeling.model.predictions - Invalid end offset:



prediction = pipe.run(

    query="how many people attended the last concert?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}

)

Inferencing Samples: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:38<00:00, 1.20 Batches/s] ERROR - haystack.modeling.model.predictions - Invalid end offset: (-26524, -26520) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-6105, -6102) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-24692, -24689) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-32411, -32404) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-27332, -27325) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-32379, -32373) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-30646, -30628) with a span answer. ERROR - haystack.modeling.model.predictions - Invalid end offset: (-30307, -30297) with a span answer.


**To Reproduce**
Tutorial 1, with 5000 text files, some are 1 page some are 15 pages long.

**Expected behavior**
with the default Game of Thrones dataset I didn't see this issue, can you please help me fix this? Many thanks.

**What environment did you try to run the tutorial on?:**
 - OS: Ubuntu 20
 - Firefox

import haystack

haystack.version '1.11.0rc0'

Additional context I suspect the problem is not specific to the first notebook. and that is why some unusual content gets printed as the result

TuanaCelik commented 2 years ago

Hi @alioz1967 I think this might be an issue about the docs not being written to your document store properly. Did you skip anything in the Preprocessing of documents section? A quick test would be to do document_store.get_document_count() Can you try that and confirm that? Also you mentioned text files but just checking: are they .txt files?

PierrickLozach commented 1 year ago

@alioz1967 did you ever fix this issue? I am facing the same one.

I load a single PDF file. I can see that document_store.get_document_count() returns 1 but when runnin pipe.run(), I get this error:

ERROR:haystack.modeling.model.predictions:Invalid end offset: 
(-32259, -32171) with a span answer. 

Query: What are you standard practices?
Answers:
[{'answer': ''}]

Here is the code I use in a colab notebook:

from haystack.utils import clean_wiki_text, convert_files_to_docs

# The path to the documents folder.
data_dir="/content/drive/MyDrive/Colab Notebooks/docfiles"

data_cleaned=convert_files_to_docs(dir_path=data_dir, clean_func=clean_wiki_text, split_paragraphs=True) 

for x in range(len(data_cleaned)):
    print(data_cleaned[x])

document_store.write_documents(data_cleaned)

print(f"Number of documents in store: {document_store.get_document_count()}")

which returns:

<Document: id=267a43496c3344b395a6ebe9359162a9, content='Genesys Security Measures v100520 Genesys Confidential
Genesys Minimum Security Controls
This sectio...'>
Number of documents in store: 1

I then set up the retriever:

from haystack.nodes.retriever import BM25Retriever
bm25_retriever = BM25Retriever(document_store=document_store)

and the reader:

from haystack.nodes import FARMReader

model_ckpt = "deepset/minilm-uncased-squad2"
max_seq_length, doc_stride = 384, 128
reader = FARMReader(model_name_or_path=model_ckpt, progress_bar=False, max_seq_len=max_seq_length, doc_stride=doc_stride, return_no_answer=True, use_gpu=True)

and ask the question:

from haystack.utils import print_answers

prediction = pipeline_app.run(
    query="What are you standard practices?", params={"Retriever": {"top_k": 1}, "Reader": {"top_k": 1}}
)

# Printing the answer
print_answers(prediction, details="minimum")

The original document contains text about the standard practices I refer to in my question.

Thanks for any help/tips you might be able to share.

jdixosnd commented 1 year ago

We faced this issue with 1.12.2 however earlier the same document produced a valid result. Apparently, the latest logic cannot handle long texts in dict and thus returns invalid end offset error. we have reverted back to 1.3.0 and it has started to work as expected.

TuanaCelik commented 1 year ago

Hey @PierrickI3 and @jdixosnd - we'd like to investigate this, but we are still having trouble reproducing the issue. If any of you are ok with sharing a file that you're indexing or anything that would make it reproducible we would be grateful.

cc: @ZanSara

PierrickLozach commented 1 year ago

HEy @TuanaCelik, sorry about the late response. I had to move on to another project and I no longer work on this.

deepset-ai / haystack-tutorials

ERROR - haystack.modeling.model.predictions - Invalid end offset >on tutorial notebook #58