apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Virtual threads and Lucene (support async tasks) #12531

Closed Jeevananthan-23 closed 3 months ago

Jeevananthan-23 commented 1 year ago

Description

In LuceneNet(C#) we had a long conversation about adding support for Async API issue but the scope of the project was not allowed because it's a port project 😒. Lucene currently has the TaskExecutor (previously named SliceExecutor) which is responsible for offloading tasks to the executor, wait for them all to complete and return the corresponding results, this is one of the examples but adding VirtualThreads(JDK21) helps to achieve concurrency (IO bound operations).

Jeevananthan-23 commented 1 year ago

cc: @mikemccand

reta commented 1 year ago

There are some discussions in different places around that, see please https://github.com/opensearch-project/OpenSearch/issues/1687#issuecomment-1621928540 for example.

uschindler commented 1 year ago

Lucene does not use threads during searching internally, everything is sequential. IndexReader is using Executors, but actually there is no reason to NOT use virtual threads.

The issue mentioned above is about disk IO, which is a different story. As Lucene uses preferably MMAP to access disks theres no way to make this async. It would need a complete rewrite of all Lucene internals, so this is not working in the basic design (and also not needed). Lucene is mostly CPU intensive operations, disk IO is not really part of the API layer.

This issue is possibly won't fix.

Jeevananthan-23 commented 10 months ago

@uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries.

Ref: https://db.cs.cmu.edu/mmap-cidr2022/

uschindler commented 10 months ago

@uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries.

Ref: https://db.cs.cmu.edu/mmap-cidr2022/

Lucene has different access patterns that are not database like. MMAP works perfectly here. Lucene is using WORM when writing files. This paper is known to us and our long-time testing figured out that it does not apply to most Lucene workloads.

Jeevananthan-23 commented 9 months ago

Hi @uschindler, I came across an interesting article on Qdrant vector database that uses io_uring for async and mmap benchmarking. https://qdrant.tech/articles/io_uring/

jpountz commented 3 months ago

I think it's fair to close this issue in favor of #13179 since this is mostly about I/O concurrency?

SteveLauC commented 3 months ago

Lucene is mostly CPU intensive operations, disk IO is not really part of the API layer.

However, IMO it is necessary to do so in parallel.

If Lucene's workload is mostly CPU-bound, then using async tasks won't parallelize the job as your event loop won't really loop, the runtime thread will be used to execute the CPU-bound tasks, it won't poll your async tasks.

Runtime: Not quite sure what it is called in the Java world, I come from a Rust world, and a runtime is some threads that are used to schedule the async tasks.

poll the async task: execute the async task and check if it is complete


Lucene has different access patterns that are not database like. MMAP works perfectly here. Lucene is using WORM when writing files. This paper is known to us and our long-time testing figured out that it does not apply to most Lucene workloads.

Hi @uschindler, would you like to show me some resources (e.g., docs) on your testing? Thanks!