fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')`

deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.72k stars 1.92k forks source link

fix: 1.x - nltk upgrade, use `nltk.download('punkt_tab')` #8256

Closed vblagoje closed 2 months ago

vblagoje commented 2 months ago

We needed to update a few more deps to get a green CI
We needed to skip nltk preprocessing tests that load pickle models (seems to be forbidden in nltk 3.9)
fixes https://github.com/deepset-ai/haystack/issues/8238

vblagoje commented 2 months ago

I've managed to get the CI to pass. Note the changes in dependencies. It couldn't be done without these and we need to pin a few more dependencies which is ok.

The nltk tests that were failing are related to inability to load old models in pickle files, which I think is forbidden now in nltk 3.9.x

I'll upgrade this PR draft into a PR

anakin87 commented 2 months ago

I would make this limitation a bit more evident.

if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.
let's also add an upgrade entry in the release note.

vblagoje commented 2 months ago

I would make this limitation a bit more evident.

if we don't want to suppress the parameter tokenizer_model_folder, we can log a clear warning.

let's also add an upgrade entry in the release note.

I opted for always None-ing tokenizer_model_folder and logging the warning with resolution path. This way we don't have to touch the codebase much and cause some unintended consequences. LMK if you have a better proposal @julian-risch @anakin87

vblagoje commented 2 months ago

OK for me.

I would prefer that @julian-risch also take a look.

Makes sense 🙏