MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Feature/lucene search engine #2892

Open yanlibert opened 2 months ago

yanlibert commented 2 months ago

Problem

👋 Thanks for opening a [pull request] Currently, the new version of Marquez uses OpenSearch as a backend for the new search feature. This might be overkill because not only it introduces an external dependency but also only the search and indexing features of OpenSearch are used.

Solution !! Warning: Currently a WiP !!

This is a small implementation of Lucene to perform only indexing and search of a dataset and a job index. This is done in a form of a subproject that can be run alongside marquez api and marquez-web. It 's designed as a drop-in replacement of OpenSearch, so it's easy to switch between this implementation or a full-fledge OpenSearch. It uses a ByteBuffersDirectory so all documents are stored in memory. The datasets and jobs are reloaded in the background at startup from the lineage_events table using the Marquez DAO.

Note: Please note that at the time of opening this PR, this is a PoC only here to open the discussion about the possibility of creating this new Marquez component, and as such it is still lacking some key elements (unit tests, integration tests, memory management feature, proper DB management, proper config ...)

netlify[bot] commented 2 months ago

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
Latest commit 5689a675927cce3457cfafe66330db4baf84a378
Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/66df15353dd9ab0008478562