apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
3.09k stars 409 forks source link

Graph Naming Convention/Rules #317

Closed susano0 closed 1 year ago

susano0 commented 2 years ago

I found that there is no naming convention for creating a graph. Currently, we can create a new graph with names such as "." and "\n".

SELECT * FROM ag_catalog.create_graph('.');
SELECT * FROM ag_catalog.create_graph('\n');

I suggest the following rules:

  1. Graph name must begin either with an alphabet or an underscore.
  2. The name can also include numbers after the first character.
  3. No spaces or other special characters except underscore should be allowed.
jrgemignani commented 2 years ago

We should look at the other graph databases and have our naming strategy similar. I agree that we need to restrict what name is valid, especially for 'special' characters. I don't know about restricting a graph name to have an initial alphabetic character, though.

What naming strategy, if any, does Neo4j have?

PragyanD commented 2 years ago

Their naming convention seems similar to what @susano0 proposed. I was taught the same convention for naming variables in my introductory CS classes., so I'm assuming it's fairly popular.

https://neo4j.com/docs/cypher-manual/current/syntax/naming/

jrgemignani commented 2 years ago

Just be aware that there are naming conventions and actual restrictions. They may say it is good practice to do it this way, but still allow you to do it another.

jrgemignani commented 2 years ago

This is a good introductory task. I would like for all of our current interns to work on this task together.

Below are the steps to follow -

  1. Research what is allowed (not just recommended) for graph names with other graph databases.
  2. Propose a standard for AGE to the whole team for approval.
  3. Propose a way to implement this standard for review and guidance to the whole team.
  4. Submit a patch, for review, to implement this and the necessary regression tests to verify it.
jrgemignani commented 2 years ago

Btw, others - non interns - are allowed to make productive comments and suggestions. We welcome everyone's input. The work for this task, however, is to be done by the interns collectively.

rafsun42 commented 2 years ago

Started this task.

I think we can implement Cypher's exact naming rules and restrictions since we are using Cypher.

I am researching some open source implementation of Cypher to see how names are being validated. I found this list here containing many such projects. This project in particular seems relevant to me. I will be looking deeper in that project. Also, I would like to know what you guys think.

susano0 commented 2 years ago

I noticed that the label names for the vertices and edges lack a naming scheme as well. I feel that's also something we should consider, although it can be less restrictive than the graph name naming convention.

jrgemignani commented 2 years ago

I agree. Good catch!

If there isn't already an issue for each of those, could you create them, and link them here?

susano0 commented 2 years ago

Created 2 separate issues for vertex (#334) and edge (#335) label naming conventions.

rafsun42 commented 2 years ago

@jrgemignani As for the 3rd step of this task,

  1. We can create a int validate_name(char* name) function that returns whether the name abides by the naming rules.
  2. Then, use this function inside the create_graph and rename_graph functions in the graph_commands.c file. I believe these are the only two places where a graph's name is set.
rafsun42 commented 2 years ago

@jrgemignani Also, a question. Do you think it is a good idea to use regex for name validation? I found a regex string and tested it with a list of label names. It seems to work fine.

rafsun42 commented 2 years ago

Neo4j has these graph naming rules in addition to the standard naming rules mentioned earlier: https://neo4j.com/docs/cypher-manual/5/databases/#administration-databases-create-database (I assume graph is similar to a database?)

These rules are ignored when the name is escaped with backtick.

jrgemignani commented 2 years ago

Regex should be okay to use. There is a regex PG library used by AGE for the ~= operator. If we use regex, we can use that library.

jrgemignani commented 2 years ago

I believe that, in our case, graph is the same as database.

jrgemignani commented 1 year ago

Lets go with the Neo4j naming rules, less the escaping part.

jrgemignani commented 1 year ago

I think that for all names for labels, vertices, and edges, that we need to comply with Neo4j's standards.

Additionally, I think that it is a good idea to be a bit strict with valid names due to potential security issues by allowing escaped or control characters.

So, lets follow Neo4j but disallow escaped characters.

rafsun42 commented 1 year ago

I implemented the naming rules and sent a pull request #349.

rafsun42 commented 1 year ago

Here are the rules that I implemented. These rules are copied from Neo4j website. I added notes (in bold face) where I diverged slightly from Neo4j's convention. Please let me know if anything needs to be changed.

Naming rules for labels:

Naming rules for graphs:

jrgemignani commented 1 year ago

Looks good @rafsun42! I agree with your changes and will review the pull request.

TropicalPenguin commented 1 year ago

I've added some feedback to the PR.

TropicalPenguin commented 1 year ago

Continuing the discussion here so as not to clutter the PR.

However, item 2, is due to potential security holes that allowing those characters could enable. For now, this restriction needs to stay in place.

I'm curious if these potential security holes are unique to AGE (because if not, it seems surprising that another platform intended for enterprise use would be willing to take such risks).

If it's a concern about a kind of injection attack: isn't that avoided by the constraint of enclosing these characters in backticks? (kinda like the use of CDATA in XML)

Totally possible that I'm missing something...

rafsun42 commented 1 year ago

To address the issue regarding @TropicalPenguin's first point in the PR,

I researched and found a grammar specification at opencypher.org. It implements the syntax for labels this way:

UnescapedSymbolicName = IdentifierStart, { IdentifierPart } ;

(* Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 *)IdentifierStart = ID_Start
                | Pc
                ;

(* Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 *)IdentifierPart = ID_Continue
               | Sc
               ;

Here, ID_Start, ID_Continue, Pc, Sc are unicode character classes whose range you can see here. They seem to cover all unicode alphabets. The reference page provided in the grammar, describes these classes as:

ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.

In UnicodeSet notation: [\p{L}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]

ID_Continue characters include ID_Start characters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus Other_ID_Continue, minus Pattern_Syntax and Pattern_White_Space code points.

In UnicodeSet notation: [\p{ID_Start}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]

I tried using few characters both from the range and outside the range in Neo4J Aura. It seems they are following the above mentioned grammar. Unfortunately, in their documentation they don't mention the ranges explicitly.

I also want to mention that for database names they say they allow "Ascii alphabetic" character (does not mention non-english), however they actually allowed "unicode alphabetic" characters in Neo4J Aura.

So, I have two questions:

  1. Should I implement the grammar I found?
  2. Should I allow unicode alphabets in graph name as well?
jrgemignani commented 1 year ago

@TropicalPenguin Ty for pointing me back here. Too many places to look for this, cry. @rafsun42 yes, and yes. We can always adjust the rules going forward if necessary.