deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.82k stars 1.84k forks source link

Pinned outdated dependencies #3139

Closed nickchomey closed 2 years ago

nickchomey commented 2 years ago

Perhaps this isn't a bug, per se, but there seem to be various dependencies that are pinned to outdated versions.

For example, throughout the codebase you seem to be using Tika 1.24.1, which was released over 2 years ago. The current version is 2.4.1. At the very least it could be apache/tika:1.28.4 (if there's a conflict with v2), or probably even apache/tika:latest?

image

Likewise, in pyproject.toml you seem to be using pytesseract v0.3.7, which is nearly two years old as well. 0.3.10 is the latest version. https://github.com/madmaze/pytesseract/releases Though, it does appear that you install the most recent version of tesseract, so perhaps this is less important than Tika.

pdf2image==1.14.0, which is 2 years old. Though the most recent 1.16.0 is over 1 year old and the project seems to be largely abandonned. https://github.com/Belval/pdf2image/releases

pydoc-markdown==4.5.1, which is only 6 months old. Though it literally has a comment FIXME Unpin! https://pypi.org/project/pydoc-markdown/#history

There's various other pinned dependencies, though they seem more recent and therefore more likely to be deliberate. But perhaps you'd like to make them >= x, <y as you have for most other packages?

There's also other Docker images that are outdated.

Of course, I'm new to Haystack and don't know the codebase at all, so please disregard any of this if it is a non-issue. But I hope this is helpful in some way!

danielbichuetti commented 2 years ago

@nickchomey It's better to keep discussions that are not intrinsic code changes out of the PR. It's better on the file discussions on the issue itself.

I have managed to implement some dependencies updates and remove some pins to avoid issues like this one. However, Tika 2 has implemented many changes, which impacts the client module used: tika-python.

Currently, we need to wait for the developer to update the module to support Tika 2.x. There is one issue on the module repo: https://github.com/chrismattmann/tika-python/issues/359

If you want to help on this, please vote on that issue to developer update the module, please.

nickchomey commented 2 years ago

Thanks for the clarification. I've followed up on that issue to see if they will add 2.x compatibility, and also whether the project is just abandoned.

If it is abandoned, should we continue to use it? Especially when Tika itself plans to deprecate 1.x on Sept 30,2022 (25 days from now)? https://tika.apache.org/