Graph Naming Convention/Rules

susano0 commented 2 years ago

I found that there is no naming convention for creating a graph. Currently, we can create a new graph with names such as "." and "\n".

SELECT * FROM ag_catalog.create_graph('.');
SELECT * FROM ag_catalog.create_graph('\n');

I suggest the following rules:

Graph name must begin either with an alphabet or an underscore.
The name can also include numbers after the first character.
No spaces or other special characters except underscore should be allowed.

jrgemignani commented 2 years ago

We should look at the other graph databases and have our naming strategy similar. I agree that we need to restrict what name is valid, especially for 'special' characters. I don't know about restricting a graph name to have an initial alphabetic character, though.

What naming strategy, if any, does Neo4j have?

PragyanD commented 2 years ago

Their naming convention seems similar to what @susano0 proposed. I was taught the same convention for naming variables in my introductory CS classes., so I'm assuming it's fairly popular.

https://neo4j.com/docs/cypher-manual/current/syntax/naming/

jrgemignani commented 2 years ago

Just be aware that there are naming conventions and actual restrictions. They may say it is good practice to do it this way, but still allow you to do it another.

jrgemignani commented 2 years ago

This is a good introductory task. I would like for all of our current interns to work on this task together.

Below are the steps to follow -

Research what is allowed (not just recommended) for graph names with other graph databases.
Propose a standard for AGE to the whole team for approval.
Propose a way to implement this standard for review and guidance to the whole team.
Submit a patch, for review, to implement this and the necessary regression tests to verify it.

jrgemignani commented 2 years ago

Btw, others - non interns - are allowed to make productive comments and suggestions. We welcome everyone's input. The work for this task, however, is to be done by the interns collectively.

rafsun42 commented 2 years ago

Started this task.

I think we can implement Cypher's exact naming rules and restrictions since we are using Cypher.

I am researching some open source implementation of Cypher to see how names are being validated. I found this list here containing many such projects. This project in particular seems relevant to me. I will be looking deeper in that project. Also, I would like to know what you guys think.

susano0 commented 2 years ago

I noticed that the label names for the vertices and edges lack a naming scheme as well. I feel that's also something we should consider, although it can be less restrictive than the graph name naming convention.

jrgemignani commented 2 years ago

I agree. Good catch!

If there isn't already an issue for each of those, could you create them, and link them here?

susano0 commented 2 years ago

Created 2 separate issues for vertex (#334) and edge (#335) label naming conventions.

rafsun42 commented 2 years ago

@jrgemignani As for the 3rd step of this task,

We can create a int validate_name(char* name) function that returns whether the name abides by the naming rules.
Then, use this function inside the create_graph and rename_graph functions in the graph_commands.c file. I believe these are the only two places where a graph's name is set.

rafsun42 commented 2 years ago

@jrgemignani Also, a question. Do you think it is a good idea to use regex for name validation? I found a regex string and tested it with a list of label names. It seems to work fine.

rafsun42 commented 2 years ago

Neo4j has these graph naming rules in addition to the standard naming rules mentioned earlier: https://neo4j.com/docs/cypher-manual/5/databases/#administration-databases-create-database (I assume graph is similar to a database?)

These rules are ignored when the name is escaped with backtick.

jrgemignani commented 2 years ago

Regex should be okay to use. There is a regex PG library used by AGE for the ~= operator. If we use regex, we can use that library.

jrgemignani commented 2 years ago

I believe that, in our case, graph is the same as database.

jrgemignani commented 1 year ago

Lets go with the Neo4j naming rules, less the escaping part.

jrgemignani commented 1 year ago

I think that for all names for labels, vertices, and edges, that we need to comply with Neo4j's standards.

Additionally, I think that it is a good idea to be a bit strict with valid names due to potential security issues by allowing escaped or control characters.

So, lets follow Neo4j but disallow escaped characters.

rafsun42 commented 1 year ago

I implemented the naming rules and sent a pull request #349.

rafsun42 commented 1 year ago

Here are the rules that I implemented. These rules are copied from Neo4j website. I added notes (in bold face) where I diverged slightly from Neo4j's convention. Please let me know if anything needs to be changed.

Naming rules for labels:

Alphabetic characters: Names should begin with an alphabetic character. This includes "non-English" characters, such as å, ä, ö, ü etc. (Note: I allowed underscore as well since Neo4j allows too, though does not mention it. Range of alphabetic character is a-zA-Z/x80-/xFF which is the same range found in the grammar file).
Numbers: Names should not begin with a number. To illustrate, 1first is not allowed, whereas first1 is allowed.
Symbols: Names should not contain symbols, except for underscore, as in my_variable, or $ as the first character to denote a parameter, as given by $myParam. (Note: I did not allow dollar symbol. I think parameter is not applicable outside cypher query where create_vlabel is called, for example).
Length: Can be very long, up to 65535 (2^16 - 1) or 65534 characters, depending on the version of Neo4j.

Naming rules for graphs:

Database name length must be between 3 and 63 characters.
The first character must be an ASCII alphabetic character. (including non-english) (Note: I allowed underscore as well. Range of alphabetic character is a-zA-Z/x80-/xFF)
Subsequent characters can be ASCII alphabetic (mydatabase), numeric characters (mydatabase2), dots (main.db), and dashes (main-db). (Note: I allowed underscore as well.)
Names cannot end with dots or dashes.
Names that begin with an underscore or with the prefix system are reserved for internal use. (Note: I allowed these reserved names.)

jrgemignani commented 1 year ago

Looks good @rafsun42! I agree with your changes and will review the pull request.

TropicalPenguin commented 1 year ago

I've added some feedback to the PR.

TropicalPenguin commented 1 year ago

Continuing the discussion here so as not to clutter the PR.

However, item 2, is due to potential security holes that allowing those characters could enable. For now, this restriction needs to stay in place.

I'm curious if these potential security holes are unique to AGE (because if not, it seems surprising that another platform intended for enterprise use would be willing to take such risks).

If it's a concern about a kind of injection attack: isn't that avoided by the constraint of enclosing these characters in backticks? (kinda like the use of CDATA in XML)

Totally possible that I'm missing something...

rafsun42 commented 1 year ago

To address the issue regarding @TropicalPenguin's first point in the PR,

I researched and found a grammar specification at opencypher.org. It implements the syntax for labels this way:

UnescapedSymbolicName = IdentifierStart, { IdentifierPart } ;

(* Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 *)IdentifierStart = ID_Start
                | Pc
                ;

(* Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 *)IdentifierPart = ID_Continue
               | Sc
               ;

Here, ID_Start, ID_Continue, Pc, Sc are unicode character classes whose range you can see here. They seem to cover all unicode alphabets. The reference page provided in the grammar, describes these classes as:

ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.

In UnicodeSet notation: [\p{L}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]

ID_Continue characters include ID_Start characters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus Other_ID_Continue, minus Pattern_Syntax and Pattern_White_Space code points.

In UnicodeSet notation: [\p{ID_Start}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]

I tried using few characters both from the range and outside the range in Neo4J Aura. It seems they are following the above mentioned grammar. Unfortunately, in their documentation they don't mention the ranges explicitly.

I also want to mention that for database names they say they allow "Ascii alphabetic" character (does not mention non-english), however they actually allowed "unicode alphabetic" characters in Neo4J Aura.

So, I have two questions:

Should I implement the grammar I found?
Should I allow unicode alphabets in graph name as well?

jrgemignani commented 1 year ago

@TropicalPenguin Ty for pointing me back here. Too many places to look for this, cry. @rafsun42 yes, and yes. We can always adjust the rules going forward if necessary.

apache / age

Graph Naming Convention/Rules #317