cozodb / cozo

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!
https://cozodb.org
Mozilla Public License 2.0
3.24k stars 92 forks source link

Review & comment on CozoDB tutorial/overview #156

Closed dustyatx closed 8 months ago

dustyatx commented 11 months ago

Hey I'm trying to learn CozoDB, so I ran the documentation through Claude 2 100k and manually guided it in writing a tutorial that would make it easier to get started. I did my best to make catch any hallucinations but I might have missed some stuff as I'm not well versed enough on some of these concepts.

Would someone be kind enough to comment on the doc, let me know what I should correct or if I missed any important topics or concepts? I'll wait to publish it..

I have a pretty good LinkedIn network so I should be able to get exposure with few hundred data engineers, data scientists, cloud architects, product managers and devs, etc..

Google Doc - CozoDB Tutorial

zh217 commented 11 months ago

That's a very good idea! However, a casual glance at the generated doc shows that it still hallucinates quite a lot, and does not have a grasp on the grammar of the query language (it gets it right only in the simplest cases). The structure of the documentation is good though, and probably using retrieval-augmented generation with a stronger model (like GPT-4) will make it better.

dustyatx commented 11 months ago

Thank you for taking a look. I was testing Claude 2 100k to see if the larger context would give it enough information to fill in the gaps. I fed it most of the documentation and then started trying to build off of that as a foundation. So a bit of a failed experiment but maybe I didn't find the right use for that model.

I have another technique that I can try which is a bit more than just document retrieval, it takes a while for me to get it worked out for specific use cases.. I have stuff but it's all prototypes and needs a lot of interventions right now.. but that's the reason why I'm trying learn Cozo..

I'll give it another try, I think this other approach does a better job with controlling hallucinations.

dustyatx commented 11 months ago

It would really help if I had more examples of queries.. if you have any you can share, I have an idea on how I can use it..

infogulch commented 11 months ago

Ideally you'd be able to randomly generate syntactically valid queries. Then you could train it with as much data as it wants. :)

ansarizafar commented 11 months ago

@dustyatx That's a very good idea! I am also interested in a simple and easy to understand tutorial as Cozodb documentation is incomplete and not beginner friendly.

dustyatx commented 11 months ago

I just tried another attempt at this but the documentation seems to be triggering a lot more hallucinations than I normally get. No matter the parameters I use, GPT 4, Bison, Claude 2.

I use this process all the time at work and it usually provides a far better result than a person can accomplish. It starts with extracting the rules & core concepts. Then I inject those into a prompt that's uses that information to rewrite and expand the other sections. I've had good success with this in the past, but unfortunately it's not getting the rules & core concepts extracted properly.

Here is my best attempt to troubleshoot why the LLMs are hallucinating so much. I'd say the feedback tracks with my own experience trying to understand the documentation.

Troubleshooting prompt, after the prompt, I passed int the queries.rst doc.

Please analyze the provided Cozo DB software documentation with a focus on identifying issues with the documentation that could be triggering hallucinations. We need feedback on how to improve the documentation, not a explanation of the limitations of a large language model. The model has a foundational understanding of Datalog. Is there any key or critical information missing? Is this document complete? We are trying to troubleshoot why this document triggers hallucinations that compound quickly. Evaluate the progression and structure of the examples provided. Are there any issues that would confuse a large language model model. Assess the explanations of key concepts. Are these concepts explained in a way that would be clear to a large language model model. Examine the use of syntax in the documentation. Is the syntax used consistently and explained clearly? Examine the examples provided, do they give a clear understanding of the concepts they are describing? Review the explanations of specific concepts. Are these concepts introduced and explained in a way that is clear and understandable? Provide SPECIFIC examples from the text to illustrate your points and explain why these aspects of the documentation may be causing issues. DO NOT FOCUS ON THE CAPABILITIES OR THE ABILITIES OF THE MODEL, IT'S TRAINING DATA OR SPECULATE ON WHY IT MIGHT GET CONFUSED. ONLY EXPLAIN ISSUES WITH THE DOCUMENTATION, DO NOT SPECULATE OR GUESS ABOUT THINGS THAT ARE NOT CONTAINED IN THE TEXT. DO NOT COMMENT ON URLS, EXTERNAL DOCUMENTS OR VISUALIZATIONS! THE LARGE LANGUAGE MODEL DOES NOT HAVE THE ABILITY TO USE THOSE.


In general, the documentation provided is clear and explains key concepts of CozoScript. However, there are some potential areas of improvement that might be causing confusion for the AI. Here are some possible issues:

  1. Explanation of atoms and syntax: The documentation attempts to cover a lot of ground in explaining the complex topic of atoms. However, the explanations are a bit dense and may be overwhelming to someone (or an AI) unfamiliar with the subject. This could potentially cause confusion, leading to the reported hallucinations. For example:

    Atoms come in various flavours. In the example above: rule_a['constant_string', b] is an atom representing a *rule application*: a rule namedrule_amust exist in the same query and have the correct arity (2 here). Each row in the named rule is then *unified* with the bindings given as parameters in the square bracket...

    A more structured breakdown and categorization of the different types of atoms and their syntax could be helpful.

  2. Lack of definitions for critical terms: Terms such as "atom", "rule", "relation", "arity", and "unification" are used throughout the documentation. However, their definitions are not clearly introduced. This lack of clear definitions could cause a model to be uncertain or make incorrect assumptions about their meanings.

  3. Use of complex language and jargon: The documentation uses a lot of complex language and specific jargon, which might be difficult to parse without a detailed understanding of the underlying principles. This could lead to difficulties in comprehending and thus misinterpretations. For instance:

    Recursion cannot occur in negated positions (*safety rule*):r[a] := not r[a]is not allowed.

    It would be beneficial to explain what "negated positions" means before stating the rule about recursion.

  4. Lack of explicit examples: While there are examples provided, they often involve complex concepts and syntaxes introduced without much context or breakdown. The examples would be clearer if they were tied to simpler concepts first, with explanations of how the syntax represents those concepts. For example:

    ?[department, count(employee)] := *personnel{department, employee}

    In this example, explaining how this line represents a query for counting employees in each department could help with understanding.

  5. Insufficient distinction between types of rules: Inline and fixed rules are described but the differentiation between them is not entirely clear. The unique traits of each type should be more clearly defined and explained, possibly with examples demonstrating the different characteristics of each.

  6. Lack of contextual linking between topics: The document jumps from topic to topic without much linkage. For example, the section on recursion directly follows the section on negation, but there's no transition to help a reader (or an AI) understand how these concepts are related.

  7. Inconsistent use of punctuation in syntax: The documentation uses both square brackets [] and curly braces {} in the syntax without clearly explaining when to use one over the other. For example:

    *stored_relation[bind1, bind2]

    *stored_relation{col1: bind1, col2: bind2}

  8. Potentially confusing formatting: Use of :=, <~, <- and [] can potentially be confusing. It would be beneficial to consistently use a certain format to express certain actions in the language.

  9. One of the noticeable issues with the provided documentation is its heavy reliance on a reader's prior understanding of Datalog, and to a certain extent, the specific nuances of the Cozo DB system.

Summary: Datalog is a logic programming language, and its concepts form the core of CozoScript, the query language described in the documentation. Knowledge of Datalog is assumed implicitly, for example, in the sections describing the relationship between the rule head and rule body, the concept of 'relations', the use of conjunctions, the binding and unifying of variables, and the use of rule applications.

Similarly, the document uses Cozo DB specific terms such as 'inline rules', 'fixed rules', 'constant rules' and Cozo-specific operators like :=, <~, and <- without providing an introduction or context to these terms. For someone familiar with Datalog and Cozo DB, these terms and operators would be clear. However, a reader who lacks prior knowledge may struggle to understand their precise meaning and function.

The document could be greatly improved with a brief introduction to Datalog and an explanation of how CozoScript builds upon it. Key concepts of Datalog, such as rules, relations, and variables, should be explicitly explained to ensure that the reader has a firm grasp of the underlying concepts before moving on to CozoScript's specific implementation and syntax. Similarly, introducing and explaining the unique terms, rules, and operators of CozoScript would be helpful to ensure readers are not left confused or misinterpreting the subsequent sections.

dustyatx commented 11 months ago

Not to criticize the author(s) of the doc but the current state of the documentation does make it a challenge to get started with Cozo.. Given that it's best to get people using it as much as possible and providing feedback at this stage, it's probably a good place to invest a bit more time and energy.

As much as writing documentation is a painful and takes up valuable time coding time; I think this creates a significant blocker for onboarding new users.

I'm happy to help review updates and give feedback.

zh217 commented 11 months ago

I agree that the docs are not up to standard, but an even bigger problem faced by all new languages/DSLs is that they are new, and hence GPT/other LLMs cannot write correct code for it. This has been troubling me a lot for several months and no amount of prompting can fix it. (Making Cozo speak SQL instantly solves all problems, of course, but that defeats the purpose of this project.)

We actually have a half-decent generator of valid CozoScript but training/fine-tuning LLMs on the output is infeasible at the moment due to the high cost. I think we can just wait a bit longer as at the moment, as LLMs are evolving faster than anyone can write documentations.

gisborne commented 11 months ago

I intend to work on the docs soon. I wonder how hard it would be to train a LLM on Cozoscript? Even a half-decent English language query ability would be amazing, and in the fullness of such a thing, Cozoscript had to be easier to generate than SQL. On Jul 22, 2023 at 19:29 -0700, Ziyang Hu @.***>, wrote:

I agree that the docs are not up to standard, but an even bigger problem faced by all new languages/DSLs is that they are new, and hence GPT/other LLMs cannot write correct code for it. This has been troubling me a lot for several months and no amount of prompting can fix it. (Making Cozo speak SQL instantly solves all problems, of course, but that defeats the purpose of this project.) We actually have a half-decent generator of valid CozoScript but training/fine-tuning LLMs on the output is infeasible at the moment due to the high cost. I think we can just wait a bit longer as at the moment LLMs are evolving faster than anyone can write documentations. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ansarizafar commented 11 months ago

I totally agree with @dustyatx on all points. Most of the developers know SQL so It would be of great help If we can add SQL to Cozoscript Mapping Chart in the documentation like. mongodb.com/docs/manual/reference/sql-comparison

gisborne commented 11 months ago

I’ll take a crack at this. On Jul 22, 2023 at 23:31 -0700, Zafar Ansari @.***>, wrote:

I totally agree with @dustyatx on all points. Most of the developers know SQL so It would be of great help If we can add SQL to Cozoscript Mapping Chart in the documentation like. mongodb.com/docs/manual/reference/sql-comparison — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

dustyatx commented 11 months ago

@gisborne If you could put together a couple of step by step user journeys, I'm happy to run through them to test to make sure it's a standalone document. Start to finish installing, loading data & querying the data (for all the major use cases), saving and reloading the database, etc.

Am I correct in thinking that Graph, Search (Full-text & Similarity), Analytics are the most important capabilities that have to be demonstrated for onboarding? Maybe my bias but that is why I'm focusing on CozoDB since the other vector DBs don't provide all of these capabilities.

The other thing I was trying to figure out is how is the CozoScript variant of Datalog different from the standard one, I had a pretty good sense of the basics of Datalog after going through the tutorial Learn Datalog Today . If I had the mapping, I think I might have been able to get GPT4 to figure out how to transform the Datalog tutorial into a CozoScript tutorial. Maybe not the first thing to be done but since information is so scarce on this topic, it would be good if the community could leverage the Datalog information to learn CozoScript.

athanhat commented 10 months ago

Documentation will become a lot better if there are definitions followed by simple examples that illustrate concepts, syntax, use cases of CozoDB datalog version. Moreover since CozoDB is trying to become truly relational it will be beneficial to demonstrate how relational operators are transformed in CozoScript language. Most important I would like to see:

Moreover I agree 100% with @zh217 - Ziyang, stay away from any SQL implementation, but it is a good idea to create some SQL to CozoScript mapping. For example after a lot of experimentation I found how to write Outer Join queries but I am not sure if that is the optimal way in CozoScript. Does it make sense to create an operator to simplify this?

Kudos on the effort to create a relational datalog DBMS, wish you a successful continuation

gisborne commented 5 months ago

I can write some more if folks tell me what is needed.

However, the last time I made a PR to improve the docs, it broke the inline examples and I don’t know how to avoid that. On 14 Jan 2024 at 05:44 -0800, Ivan Suslov @.***>, wrote:

Bumping it up. Seems like we are missing key concepts in documentation which can help a lot. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>