At my team, we usually use the search field to search for datasets with various characteristics. Currently this field operates with keyword matching of columns or words in the description of a dataset.
Since RAG adoption is on the rise, I'd like to propose a new feature leveraging natural language for searching the datasets:
All the information regarding the datasets (description, tags, owner, etc.) are transformed into embeddings and stored in a vector db.
The user would then use the search bar to write a complete question like "What are the datasets we have about X?", "Who is the owner of dataset Y", and the system would perform a similarity search + generation with LLM to answer the query.
Since there is already an ontology defined in DataHub, there could even be a more sophisticated graph RAG to answer questions involving relationships like "How many datasets we have regarding Z?", "Which dataset is the parent of W?", etc.
At my team, we usually use the search field to search for datasets with various characteristics. Currently this field operates with keyword matching of columns or words in the description of a dataset.
Since RAG adoption is on the rise, I'd like to propose a new feature leveraging natural language for searching the datasets:
I think that feature would greatly enhance the user experience and productivity, provide a competitive advantage against other solutions (https://www.secoda.co/blog/transforming-data-discovery-using-secoda-ai) and open new possibilities for the platform as a whole.