deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.62k stars 1.91k forks source link

Please outsource your efforts to spaCy, the industry-standard for NLP pre-processing #3140

Closed nickchomey closed 8 months ago

nickchomey commented 2 years ago

EDIT: It's clear from the feedback that I was very premature and wrong to say that Haystack isn't/can't be performant. Moreover, I appreciate the feedback that spaCy probably wouldn't move the needle THAT much given that the inference mechanisms themselves are probably the bottleneck. But, I do think there's still lots of potential for spaCy to be integrated in a meaningful way. I'm new to all of this, but once I get a better appreciation for how Haystack works, I will see what I can do to create a custom node for spaCy - hopefully others will be able to provide feedback and improvements for it.


First off, Haystack is phenomenal - it is precisely what I was yearning for while spinning my wheels trying to build some sort of Semantic Search pipeline for Elasticsearch over the past couple weeks. By integrating NLP with Elasticsearch in an accessible way, it should really open up the possibilities for non-search engineers to implement semantic search in their applications. I was blown away that I could get your demo site up and running in under a minute using Docker.

However, I was then enormously underwhelmed when searches took nearly 10 seconds on what I consider to be reasonably capable hardware. I'm sure that there's room for tweaking the models and probably even my system configuration, but could it change from 10 seconds to 30-100ms? Probably not... Your own demo site takes 3+ seconds for most searches and your benchmarks on the very capable AWS p3.2xlarge server seem to concur that it is not a particularly performant, and therefore practical, experience at the moment.

I won't pretend to have a clue about how to optimize any of this and am extremely grateful that Haystack even exists, but one oversight that seems quite obvious to me is that you are not making use of spaCy in your pre-processing pipeline.


As I wrote previously in another issue, spaCy and Haystack seem to have almost the completely same ethos and focus - to consolidate state of the art (academic) techniques and tools into one high-performance package that is very accessible to practitioners. spaCy has become the defacto industry standard for NLP pre-procesing because it replaces a patchwork of confusing tools (NLTK etc...) with an easy-to-use, fully integrated toolkit with a vast feature set. Moreover, it is considerably more performant due to its Cython architecture.

Here is a great article that goes into detail about Python vs Cython for NLP, showing 5.5x better speed for spaCy vs NLTK (all while performing incomparably more tasks). And that was written 2.5 years ago when v2.2.3 was current - it has since gone to 2.3.5 and now 3.4.1.

And beyond top-notch pre-processing capabilities, spaCy offers immense multilingual support and, since v3.0, also have an entire transformer mechanism that integrates with huggingface models.

So, it seems quite clear to me that Haystack would be very wise to outsource as much of its pre-processing efforts as possible to the industry-standard spaCy toolkit, so that you can focus on what you do best - integrating NLP into a search pipeline/application.


I think you could start with a review of your various Nodes and see if there is an analogue in spaCy. Quite clearly, your use of NLTK in the PreProcessor should be switched to spaCy, if not replacing the PreProcessor Node entirely. But you could probably also use its Transformers pipeline with/in lieu of your Document Classifier, Entity Extraction and probably other Nodes.

Likewise, I think you could lean more heavily on Tika - you've got all sorts of FileConverters, but surely Tika could just be used for most of this (and more)? It obviously Classifies file types (making your File Classifier redundant), has Language Detection (as does spaCy) and even has a native Tesseract integration, which should allow you to get rid of the pytesseract dependency and just let it do all the work for you in Java. Also, the version of Tika that you are using is over 2 years out of date, as I described in this issue.


I really hope that this is helpful - I'm amazed with Haystack and simply want to help you guys make the best use of your time and the ecosystem's tools to help us bring better, more relevant search to our meaningful real-world applications.

Thanks!

nickchomey commented 2 years ago

Also, this defunct repo, Hello NLP, might be useful for reference.

It was created by Max Irwin, who seems to be a prominent figure in this space - he's currently writing this book on AI Powered Search, and created this service/tool for Semantic Search http://max.io/

He introduced it a couple years ago in this video, where he explains the various tools that he chose to address certain problems, spaCy being a prominent one.

I was disappointed to find that the project never really took off, but Haystack (which was only at 0.4.0 at the time) seems to be the ideal successor to it!

image

danielbichuetti commented 2 years ago

I can understand some of your points, but I may disagree completely that haystack is unable to deliver a performant search.

Firstly, I would like to say that we are migrating our own internal code to haystack due to its quality. Regarding extended search times, there are some points. Despise being easy to use and to deploy a basic setup, you shouldn't have expectations that a default database setup and the default settings on haystack would be able to give a ms search on every case. Database is the real bottleneck for most deployments, followed by the ability to run enormous models easily. For a real-time inference using the SOTA models, you would need a tuned inference server. On AWS, you have even specialized instances for that.

So, if you have a general database deployment (dense or sparse) and a low-profile instance, you will have slow responses. High quality, but not as fast as you desire.

And we come to another point, how you set up your pipeline. Remember, sparse retrieval is fast, dense is slow(except when using specialized vector databases). Putting a Reader after hundreds of documents, so it can extract the best answer, you are basically running a powerful model ML model at that specific request on numerous documents.

Haystack is so powerful that we are using it in a mix of serverless, microservices and giant instances. It's a question of splitting the nodes around for best performance. I just looked over X-Ray on our AWS Lambda function running retrieval nodes, we get responses at around 150ms. Then it's forwarded to one of our big instances, which take about double time. Our average time is 450ms. But we are still making changes, like implementing Ray to boost performance.

nickchomey commented 2 years ago

@danielbichuetti Thanks very much for the insights! It's great to hear that you're having such success with it! To clarify, I have no doubt that Haystack is very powerful and that I am barely scratching the surface.

Though, I do think my point about the demo's poor performance is valid - if not on my own hardware then at least on theirs! You only have one chance at a good first impression and surely people are turned off when their first experience is with 3+ second searches. If it is a matter of poor configuration, then that should be addressed.

Moreover, I think my point about spaCy - which is the point of this Issue that you unfortunately didn't address at all - still stands quite firmly. spaCy is to NLP preprocessing what Haystack intends to be for NLP Search and should therefore be used. Not only is it much faster than NLTK, Scikit Learn, and probably many other tools that are being used in Haystack Nodes/pipelines, but it probably would open a very easy door to various other Nodes/techniques.

nickchomey commented 2 years ago

@lalitpagaria I hope it is ok to tag you here - I see that you incorporated spaCy into Obsei in various ways, so perhaps you have something useful to say/contribute?

lalitpagaria commented 2 years ago

Thank you @nickchomey for a well researched discussion. Let me go through this in depth and then I will share my views :)

vblagoje commented 2 years ago

@nickchomey,

I'd be happy to address your comments and gladly jump on the call with you if some concerns remain unaddressed.

Our Haystack demo was optimized for simplicity, not for speed. Having said that - we see your concerns with the "first impression" and will look into potential speed optimizations we can make. In the default setting, the demo returns the response in under 2 seconds for most of the queries I tried. Also, please remember that the response time is linear to the number of docs retrieved by the retriever. If you set the slider "max number of documents from retriever" to 1 or 2 - the demo should be faster as we do expensive GPU-bound inferencing on each retrieved document.

Regarding spaCy, we think there are some excellent synergies, and we welcome contributions for better preprocessing. Openness has always been one of our guiding philosophies, and Haystack has been designed from the ground up to integrate custom nodes.

To echo Daniel's comments, getting < 300-500ms with QA is possible, even for big search traffic. For pure document search, of course, even faster. We have several production users who used it happily in production. Still, there are, of course, many aspects we can further optimize. In the following months, we want to do more benchmarking and investigations to understand the biggest levers for further improvements. However, with the time spent on QA inference (dense retriever inference), we don't see how spaCy will be a big help here.

Thank you for your comments, and once again, I'd be glad to jump on a call with you.

nickchomey commented 2 years ago

Thanks very much for the responses! Evidently, I was quite wrong and out of place to say (based on a basic first impression) that Haystack has performance issues. So, I apologize for that and have edited the OP to reflect that. But I do think the demo could be "optimized" a bit more as selecting 1 or 2 results doesn't seem to be a particularly practical solution. Perhaps some changes could be made to the config - such as using minilm-uncased-squad2 by default - so as to make a better first impression?

Anyway, I regret bringing any of that up as it was all a distraction from the real topic at hand: spaCy.

I think we all agree that spaCy has a lot to offer Haystack, but surely the team has higher priorities to focus on (especially since @vblagoje is surely correct that it isn't the real bottleneck). So, I'll spend some time working through the Tutorials and codebase to get a better understanding for how Haystack works, and then I'll see what I can do to create a custom Node for spaCy - such an effort would surely be the best training program for Haystack and NLP in general. Hopefully others will be able to provide guidance and input in the process!

If anyone has thoughts/feedback about how a spaCy Node could work (what features to make use of etc...), it would be greatly appreciated.

@vblagoje I'm not sure that a call is necessary - I'm quite hesitant to waste your time with 101-level questions. But if you feel that you'd like to help push me in the right direction with this, that would be wonderful!