UW-Madison-DSI / ask-xDD

Retrieval-Augmented Generation (RAG) on 17M full text journal articles.
https://xdd.wisc.edu/
MIT License
2 stars 2 forks source link

[QoL] Remove farm-haystack from system #113

Open JasonLo opened 7 months ago

JasonLo commented 7 months ago

farm-haystack is required for data preprocessing and ingest. But this package is poorly maintained. For example:

We may want to replace it with better package somehow.

iross commented 7 months ago

It looks like they have a haystack v2 beta available that would presumably address most or all of these issues. Looking over the docs, it's not clear if it's a straight swap or if there would be more changes involved.

Are there other comparable pipelining toolsets that could be an appropriate alternative?

JasonLo commented 7 months ago

On top of my head: nltk, gensim, and spacy? Not sure which one is better. Let take a look together and decide. Perhaps somewhat related to the encoding problem in Elastic too...