bleve vs bluge question

gnewton commented 2 years ago

Hello,

I hope you don't mind me asking these questions. :-)

My understanding is that bluge is the replacement for bleve. Could you let me know why you chose to stop development of bleve and start bluge (sorry if this explanation exists elsewhere, I haven't been able to find it). How is the design or implementation of bluge an improvement over bleve? I am asking as a long time user of Lucene, and wanting a performant Go replacement for certain projects. I would appreciate you sharing some of the design direction regarding bluge, and perhaps the use cases where bluge is/will be an improvement over bleve.

constructively, :-) Glen

mschoch commented 2 years ago

Thanks for the questions, many of them are answered here: https://blugelabs.com/blog/introducing-bluge/ But some of that may now be old, so I'll try to clearly answer these again.

My understanding is that bluge is the replacement for bleve.

Not exactly. I would say that Bluge is a new project that shares many of the same goals as Bleve. It could function as a replacement for Bleve in some use cases, but not others.

Could you let me know why you chose to stop development of bleve and start bluge

I did not stop development of Bleve, and do not have the power to do so. Bleve continues to be developed by the community. (there are new commits as of 9 days ago).

I chose to start Bluge because I could no longer get anyone to pay me to work Bleve. Bleve is used by many companies in production, and therefore proper maintenance of the library requires care. Without money to support such activities, I'd rather not spend any time on the parts of the project I don't care about. Bluge allowed me to reset and focus on the parts of a search library I find interesting (which is great because I'm not being paid)

How is the design or implementation of bluge an improvement over bleve?

The design of Bluge is largely the same as Bleve (much of the code is the same). One of the biggest design changes of Bluge is that it now supports a pluggable Directory implementation (much like Lucene since you mentioned familiarity with it). This Directory interface decouples the index implementation from many OS/filesystem details. This let us introduce a much more efficient in-memory-only index, something Bleve still struggles to offer today.

The next major change is that Bluge supports accessing the index from multiple process (via OS locking primitives). Only one process can write at a time, but it is very useful architecturally to be able to search indexes from multiple processes. Again, not everyone cares about this, but if you need it, it is not possible with Bleve today.

Finally, there are a bunch of other smaller changes. Bluge uses BM25, not TF/IDF, and has more configurable/customizable scoring. And Bluge has a proper aggregation framework, allowing you to easily build your own aggregations over all the search hits seen (in Bleve, only a limited faceting option is available).

I am asking as a long time user of Lucene, and wanting a performant Go replacement for certain projects. I would appreciate you sharing some of the design direction regarding bluge, and perhaps the use cases where bluge is/will be an improvement over bleve.

Sure, so from my perspective, you have two primary factors to consider.

Technical capabilities - in this area I feel Bluge has more capabilities, but it is also newer and less tested
Support - Bleve has a longer history and a larger community, and you should consider this the safer choice. Bluge is still new, and while I think there is an emerging community of users, it is smaller and currently lacks any big name backing it's use.

Even though I have left the Bleve project, I am still close friends with the maintainers of Bleve, and we communicate regularly on topics that relate to both projects. From my perspective, at the moment the projects are complementary. If you want something more proven, you should choose Bleve, but it has technical debt and backwards compatibility concerns, so sometimes development can proceed at a frustrating pace. Generally, you should only choose Bluge today in one of two circumstances:

Bleve is unusable for you because of some known limitation (multi-process, fast in-mem, richer aggregation, etc)
You are interested and willing to get involved and contribute to Bluge.

I hope this helps, let me know if you have any further questions.

xeoncross commented 2 years ago

Thanks for your answer! Ideally, this is the type of content that should be in the readme. The current README.md doesn’t say much except how to use the lib. Nothing about motivation, lineage, or technical aspects like using mmap files or what the storage and memory designs entail. When you have time, please update the readme.

mschoch commented 2 years ago

I disagree that this content belongs in the project readme. I think is useful background info, that should be documented, and it is several places, including here now. To me, Bluge originated via Bleve, but it is not defined by it. A person first coming to this project need not read about Bleve on the main readme.

However, I am just one member of this project, if others think some or all of this content should be on the README, they should create PR to add it there.

xeoncross commented 2 years ago

That is fine, the lineage is just a small part of what I was suggesting. Knowing Bluge originated with learning's from Bleve is handy, but not required.

I'm most interested in the motivation and technical aspects of the project. As a developer I want to know if this is built for my use and how it is designed to achieve this so I can better judge if this is a library I want to base a project on.

Is the in memory implementation optimized for reads, writes, or low memory usage? If I have more than 100M of documents does the index quickly sprawl onto the swap space? Are datasets with a lot of small values an issue (does it make use of 4kB pages?) Does this use compression? Is the file format for the index files documented?

I would be happy to open a PR to answer these questions - but I don't know the answer. These are the types of questions that I would love to see on the github wiki, a /docs folder, a github page, or just in the readme so I can see it quickly. After all, only devs will be reading this project's readme given the highly technical nature of the library.

mschoch commented 2 years ago

I would suggest you open an issue and ask those questions then. In my experience people will try to answer them, or link to existing answers.

If any of those answers are things you think belong on the README, we can discuss that at that time.

Like many projects, we have gone through cycles of putting documentation in different places. The wiki just gets defaced, and docs anywhere else get stale because no one maintains them. I tend to think docs in the main repo would work better, but I've tired of moving things around. For now, Bluge docs are in the website repo: https://github.com/blugelabs/blugelabs.com/tree/master/content/bluge

gnewton commented 2 years ago

@mschoch A belated thank-you for the information/links you provided.

blugelabs / bluge

bleve vs bluge question #82