Closed nickchomey closed 2 years ago
@nickchomey It's better to keep discussions that are not intrinsic code changes out of the PR. It's better on the file discussions on the issue itself.
I have managed to implement some dependencies updates and remove some pins to avoid issues like this one. However, Tika 2 has implemented many changes, which impacts the client module used: tika-python.
Currently, we need to wait for the developer to update the module to support Tika 2.x. There is one issue on the module repo: https://github.com/chrismattmann/tika-python/issues/359
If you want to help on this, please vote on that issue to developer update the module, please.
Thanks for the clarification. I've followed up on that issue to see if they will add 2.x compatibility, and also whether the project is just abandoned.
If it is abandoned, should we continue to use it? Especially when Tika itself plans to deprecate 1.x on Sept 30,2022 (25 days from now)? https://tika.apache.org/
Perhaps this isn't a bug, per se, but there seem to be various dependencies that are pinned to outdated versions.
For example, throughout the codebase you seem to be using Tika
1.24.1
, which was released over 2 years ago. The current version is 2.4.1. At the very least it could beapache/tika:1.28.4
(if there's a conflict with v2), or probably evenapache/tika:latest
?Likewise, in pyproject.toml you seem to be using
pytesseract v0.3.7
, which is nearly two years old as well.0.3.10
is the latest version. https://github.com/madmaze/pytesseract/releases Though, it does appear that you install the most recent version of tesseract, so perhaps this is less important than Tika.pdf2image==1.14.0
, which is 2 years old. Though the most recent1.16.0
is over 1 year old and the project seems to be largely abandonned. https://github.com/Belval/pdf2image/releasespydoc-markdown==4.5.1
, which is only 6 months old. Though it literally has a commentFIXME Unpin!
https://pypi.org/project/pydoc-markdown/#historyThere's various other pinned dependencies, though they seem more recent and therefore more likely to be deliberate. But perhaps you'd like to make them
>= x, <y
as you have for most other packages?There's also other Docker images that are outdated.
Of course, I'm new to Haystack and don't know the codebase at all, so please disregard any of this if it is a non-issue. But I hope this is helpful in some way!