Closed ghost closed 4 weeks ago
Hi @alioz1967 I think this might be an issue about the docs not being written to your document store properly. Did you skip anything in the Preprocessing of documents section? A quick test would be to do document_store.get_document_count()
Can you try that and confirm that? Also you mentioned text files but just checking: are they .txt
files?
@alioz1967 did you ever fix this issue? I am facing the same one.
I load a single PDF file. I can see that document_store.get_document_count()
returns 1 but when runnin pipe.run()
, I get this error:
ERROR:haystack.modeling.model.predictions:Invalid end offset:
(-32259, -32171) with a span answer.
Query: What are you standard practices?
Answers:
[{'answer': ''}]
Here is the code I use in a colab notebook:
from haystack.utils import clean_wiki_text, convert_files_to_docs
# The path to the documents folder.
data_dir="/content/drive/MyDrive/Colab Notebooks/docfiles"
data_cleaned=convert_files_to_docs(dir_path=data_dir, clean_func=clean_wiki_text, split_paragraphs=True)
for x in range(len(data_cleaned)):
print(data_cleaned[x])
document_store.write_documents(data_cleaned)
print(f"Number of documents in store: {document_store.get_document_count()}")
which returns:
<Document: id=267a43496c3344b395a6ebe9359162a9, content='Genesys Security Measures v100520 Genesys Confidential
Genesys Minimum Security Controls
This sectio...'>
Number of documents in store: 1
I then set up the retriever:
from haystack.nodes.retriever import BM25Retriever
bm25_retriever = BM25Retriever(document_store=document_store)
and the reader:
from haystack.nodes import FARMReader
model_ckpt = "deepset/minilm-uncased-squad2"
max_seq_length, doc_stride = 384, 128
reader = FARMReader(model_name_or_path=model_ckpt, progress_bar=False, max_seq_len=max_seq_length, doc_stride=doc_stride, return_no_answer=True, use_gpu=True)
and ask the question:
from haystack.utils import print_answers
prediction = pipeline_app.run(
query="What are you standard practices?", params={"Retriever": {"top_k": 1}, "Reader": {"top_k": 1}}
)
# Printing the answer
print_answers(prediction, details="minimum")
The original document contains text about the standard practices
I refer to in my question.
Thanks for any help/tips you might be able to share.
We faced this issue with 1.12.2
however earlier the same document produced a valid result.
Apparently, the latest logic cannot handle long texts in dict and thus returns invalid end offset
error.
we have reverted back to 1.3.0
and it has started to work as expected.
Hey @PierrickI3 and @jdixosnd - we'd like to investigate this, but we are still having trouble reproducing the issue. If any of you are ok with sharing a file that you're indexing or anything that would make it reproducible we would be grateful.
cc: @ZanSara
HEy @TuanaCelik, sorry about the late response. I had to move on to another project and I no longer work on this.
Describe the issue
Hello. I'm testing the first tutorial as it is with around 5000 text files, some are 1 page some are 15 pages long. When the answer is getting printed I get this error. ERROR - haystack.modeling.model.predictions - Invalid end offset:
Additional context I suspect the problem is not specific to the first notebook. and that is why some unusual content gets printed as the result