Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.44k stars 618 forks source link

Document is not recognized as text document : disable document check #548

Open Snikch63200 opened 1 month ago

Snikch63200 commented 1 month ago

Hello,

Sometimes, documents that contains images are not recognized as text documents. For this issue, paperqa recommands to disable document checking.

Could not read Auchy les mines (62) - Haisnes-lez-la Bassée.pdf This does not look like a text document: ./Papers/Auchy les mines (62) - Haisnes-lez-la Bassée.pdf. Pass disable_check to ignore this error.

How can I disable document checking to prevent this error ?

Best regards.

dosubot[bot] commented 1 month ago

To disable document checking and prevent the error, set the disable_doc_valid_check parameter to True in your configuration. Here's how you can do it:

from paperqa.settings import ParsingSettings

# Create a ParsingSettings instance with document validity check disabled
parsing_settings = ParsingSettings(disable_doc_valid_check=True)

This will bypass the document validity check, allowing documents with images to be processed without raising an error [1].

To continue talking to Dosu, mention @dosu.

DGoettlich commented 1 month ago

@dosu

the error seems to persist even if

parsing_settings = ParsingSettings(disable_doc_valid_check=True)

is passed.

whitead commented 1 month ago

Hi @DGoettlich - you need to pass the setting:

from paperqa import Settings, Docs
settings = Settings()
settings.parsing.disable_doc_valid_check = True

d = Docs()

d.add(...., settings=settings)
dosubot[bot] commented 1 month ago

Ensure that the ParsingSettings instance with disable_doc_valid_check=True is correctly passed to the function or module responsible for processing the document. The aadd function in the Docs class is responsible for document validation and will ignore the check if this setting is properly applied [1]. Double-check that the configuration is being used where needed.

To continue talking to Dosu, mention @dosu.