OHDSI / GIS

https://ohdsi.github.io/GIS
Apache License 2.0
9 stars 9 forks source link

Put together and demonstrate a toolset that can search the JSON-LD catalog knowledge graph #327

Open kzollove opened 7 months ago

kzollove commented 7 months ago

Put together and demonstrate a toolset that can search the JSON-LD catalog knowledge graph. This could include the use of an LLM. Right now asking an LLM to explore a knowledge graph is bleeding edge but it won't be in six months to a year. Here is the state of the art maybe nine months ago: Knowledge Graphs + Large Language Models = The ability for users to ask their own questions?. We can probably develop some best practices for the use of chain of thought prompting that assists an LLM in our use case, i.e. querying a catalog of datasets at first at the dataset and then the variable levels.

fils commented 4 months ago

I just wanted to add a few thoughts in this issue. Moving from a collection of JSON-LD files to a RDF based KG is rather easy at this time. The biggest issue is to address the presence of similarly named blank nodes if you simply convert the JSON-LD to RDF (like N-quads). This is resolved if you feed the results into a triplestore since they will address the blank node on ingest to internally unique elements.

So you can, for example use Oxigraph at the the command line simply feed in the JSON-LD to it. A script like https://github.com/gleanerio/nabu/blob/df-dev/scripts/jsonldLoader.sh can do this. Note that script is for reading from an object store, but could easily be modified to work from a local directory.

The Nabu program ( https://github.com/gleanerio/nabu/blob/df-dev/docs/README.md ) is designed to generate complete graphs from JSON-LD taking into account the blank nodes and some other edge cases. The results of that program can then be fed into a triplestore or locally queried with jena or packages like Oxigraph or KuzuDB.

With respect to leveraging an LLM. It is possible to connect some of these LLM based RAG approaches. There are many examples of these. There is also the approach of leveraging the LLM to generate SPARQL or other query langauges. See https://python.langchain.com/v0.1/docs/integrations/graphs/ for some examples in LangChain.

If you are interested I'd be happy to share some more examples or work up a short document like I did for the UN Oceans community here: https://github.com/gleanerio/archetype/blob/master/networks/oceans/README.md

jaygee-on-github commented 4 months ago

@kzollove, @martyalvarez, @AEW0330: perhaps we need to save this discussion for another day. I would like this task be one which we used an augmented LLM in line with this article to explore the catalog instead of a query language.

This doesn't seem so far fetched now since this is exactly what APHRC and CODATA is planning to do in another project called Data Science Without Borders. Here we are building a catalog of research that "pathfinders" are engaged in. Each research project in the catalog is represented by a schema.org JSON-LD MedicalObservationalStudy. We are just now starting to explore how to augment an LLM and use it to query a collection of MedicalObservationalStudy knowledge graphs.