deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.66k stars 1.83k forks source link

Upgrade Haystack 1.x to NLTK 3.9 #8238

Closed julian-risch closed 1 week ago

julian-risch commented 3 weeks ago

In Haystack 1.26.x we should replace the nltk.download("punkt") with nltk.download('punkt_tab') here https://github.com/deepset-ai/haystack/blob/883cd466bd0108ff4f6af4c389f0e42fabc1282c/haystack/nodes/preprocessor/preprocessor.py#L123 so that users can use Haystack 1.26.x with NLTK 3.9. Prior NLTK versions are affected by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. We should therefore also pin NLTK to >=3.9.

While the NLTK release notes list 3.8.2 https://pypi.org/project/nltk/#history with the fix, that release disappeared from pypi. https://pypi.org/project/nltk/#history There is a comment on GitHub saying that the release was deleted and there will be a 3.9 https://github.com/nltk/nltk/issues/3301#issuecomment-2290843549

sagarneeldubey commented 3 weeks ago

Our pre-processing pipeline broke since the nltk update because of the following issues:

  1. Unpickling restricted to simple types to avoid CVE issues (the reason we have to upgrade from 3.8.1 asap). So use "PunktTokenizer" to safely load the nltk model rather than using "nltk.data.load".
  2. "punkt" package contains unsafe pickles and is replaced by "punkt_tab" [https://github.com/nltk/nltk/issues/3293]

We had to create a custom preprocessor component to fix the above issues. Happy to contribute to a fix if needed.

vblagoje commented 1 week ago

Closing as https://github.com/deepset-ai/haystack/pull/8256 has been integrated on 1.26.x branch and a new 1.26.3 release has been released with this fix.

sagarneeldubey commented 1 week ago

We upgraded farm-haystack to 1.26.3 today and we confirm that the preprocessor is working fine with nltk 3.9.1 , so we don't need the custom preprocessor anymore. Thanks a lot for your prompt response to this issue!

vblagoje commented 1 week ago

Awesome, thanks for reporting back @sagarneeldubey much appreciated 🚀