👋 Thanks for opening a [pull request]
Currently, the new version of Marquez uses OpenSearch as a backend for the new search feature.
This might be overkill because not only it introduces an external dependency but also only the search and indexing features of OpenSearch are used.
Solution !! Warning: Currently a WiP !!
This is a small implementation of Lucene to perform only indexing and search of a dataset and a job index. This is done in a form of a subproject that can be run alongside marquez api and marquez-web.
It 's designed as a drop-in replacement of OpenSearch, so it's easy to switch between this implementation or a full-fledge OpenSearch.
It uses a ByteBuffersDirectory so all documents are stored in memory. The datasets and jobs are reloaded in the background at startup from the lineage_events table using the Marquez DAO.
Note: Please note that at the time of opening this PR, this is a PoC only here to open the discussion about the possibility of creating this new Marquez component, and as such it is still lacking some key elements (unit tests, integration tests, memory management feature, proper DB management, proper config ...)
Problem
👋 Thanks for opening a [pull request] Currently, the new version of Marquez uses OpenSearch as a backend for the new search feature. This might be overkill because not only it introduces an external dependency but also only the search and indexing features of OpenSearch are used.
Solution !! Warning: Currently a WiP !!
This is a small implementation of Lucene to perform only indexing and search of a dataset and a job index. This is done in a form of a subproject that can be run alongside marquez api and marquez-web. It 's designed as a drop-in replacement of OpenSearch, so it's easy to switch between this implementation or a full-fledge OpenSearch. It uses a ByteBuffersDirectory so all documents are stored in memory. The datasets and jobs are reloaded in the background at startup from the lineage_events table using the Marquez DAO.