Closed Jeevananthan-23 closed 5 months ago
cc: @mikemccand
There are some discussions in different places around that, see please https://github.com/opensearch-project/OpenSearch/issues/1687#issuecomment-1621928540 for example.
Lucene does not use threads during searching internally, everything is sequential. IndexReader is using Executors, but actually there is no reason to NOT use virtual threads.
The issue mentioned above is about disk IO, which is a different story. As Lucene uses preferably MMAP to access disks theres no way to make this async. It would need a complete rewrite of all Lucene internals, so this is not working in the basic design (and also not needed). Lucene is mostly CPU intensive operations, disk IO is not really part of the API layer.
This issue is possibly won't fix.
@uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries.
@uschindler, I understand the complexity of completely rewriting all Lucene internals. However, IMO it is necessary to do so in parallel. Relying entirely on MMAP is a bad idea it is good for warmup queries.
Lucene has different access patterns that are not database like. MMAP works perfectly here. Lucene is using WORM when writing files. This paper is known to us and our long-time testing figured out that it does not apply to most Lucene workloads.
Hi @uschindler, I came across an interesting article on Qdrant vector database that uses io_uring for async and mmap benchmarking. https://qdrant.tech/articles/io_uring/
I think it's fair to close this issue in favor of #13179 since this is mostly about I/O concurrency?
Lucene is mostly CPU intensive operations, disk IO is not really part of the API layer.
However, IMO it is necessary to do so in parallel.
If Lucene's workload is mostly CPU-bound, then using async tasks won't parallelize the job as your event loop won't really loop, the runtime thread will be used to execute the CPU-bound tasks, it won't poll your async tasks.
Runtime: Not quite sure what it is called in the Java world, I come from a Rust world, and a runtime is some threads that are used to schedule the async tasks.
poll the async task: execute the async task and check if it is complete
Lucene has different access patterns that are not database like. MMAP works perfectly here. Lucene is using WORM when writing files. This paper is known to us and our long-time testing figured out that it does not apply to most Lucene workloads.
Hi @uschindler, would you like to show me some resources (e.g., docs) on your testing? Thanks!
Description
In LuceneNet(
C#
) we had a long conversation about adding support for Async API issue but the scope of the project was not allowed because it's a port project 😒. Lucene currently has theTaskExecutor
(previously namedSliceExecutor
) which is responsible for offloading tasks to the executor, wait for them all to complete and return the corresponding results, this is one of the examples but adding VirtualThreads(JDK21) helps to achieve concurrency (IO bound operations).