deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.62k stars 1.91k forks source link

PDFToTextConverter: [WinError 2] The system can't find the specified file #1345

Closed ehsanVIP closed 1 year ago

ehsanVIP commented 3 years ago

Hi guys, I think what you are doing is very interesting. I am currently struggling with data Preprocessing(Tutorial 8). When I open my own pdf file in function PDFToTextConverter, I get the following error:

[WinError 2] The system can't find the specified file

Unfortunately, I have not yet found a specific solution for it. Can you guide me?

tholor commented 3 years ago

Hey @ehsanVIP, Do you still face this issue? It seems that the path to your file is not correct. Can you try to access this file in your script without using any haystack specific code (e.g. via Python's open(...))

ehsanVIP commented 3 years ago

Hey @tholor , yes i still have this issue. i can open my PDF really easy with python TIKA but i can't do it with Haystack.

yusufsamsum commented 3 years ago

I also have the same issue. I used Path module as expected and check the path validity by exists() function. I think, the main cause of the problem occurs in subprocess.run command. In pdf.py file, read_pdf function executes a command with subprocess module and when I change the parameter shell=True to False, it manages to find the file but the behavior changes and the result is not the expected result.

tholor commented 3 years ago

Ok, thanks for the info @yusufsamsum . We will then investigate this windows-specific issue further. However, it might take us some time as none of our devs is on windows and it is always a hazzle to reproduce / debug there.

ehsanVIP commented 2 years ago

@tholor Hi, I'm still having this problem. Did you find a solution by any chance?

ZanSara commented 2 years ago

Not yet, sorry. This issue has been stuck in the backlog since... Sorry for that. I will pick it up in the next days and try to find out what's going on. By the way, does it happend with every file, or just in some specific conditions?

AI-Ahmed commented 2 years ago

I'm trying to find the file in Google Colab, but I have the same issue, too! PDFToTextConverter nor convert_files_to_docs can read the .pdf file!

ZanSara commented 2 years ago

I reproduced this issue and it seems like it occurs due to the lack of a dependency, pdftotext. Could you check if it is installed properly on your system? And if not, could you try to install it manually and then run your Haystack code again?

I will now investigate why pdftotext doesn't get installed on Windows. Please let me know if your issue is different.