Open vikramsubramanian opened 4 months ago
Summary: Rules for importing and exporting RDF blank nodes and their behavior in CREATE statements.
Based on the provided information and code snippets, the following solutions can be applied to address the issues:
Ensure that blank nodes with the same label are treated as the same entity within a single import operation:
rrlHandle
and allHandle
functions in rdf_reader.cpp
to maintain a map of blank node labels to generated IRIs within the scope of a single import operation. Reset this map for each new COPY
statement.Differentiate blank nodes with the same label across different COPY
statements:
COPY
statement in rdf_reader.cpp
, ensuring that blank node label to IRI mappings do not persist across different COPY
operations.Follow the convention for generated blank node IRIs (_:ibj
):
rdf_reader.cpp
, when generating IRIs for blank nodes, ensure the convention _:ibj
is followed, where i
and j
are unique integers for each blank node.Handle user-defined IRIs that match the generated blank node IRI pattern:
read_BLANK_NODE_LABEL
function in n3.c
, add logic to detect if a user-defined IRI matches the generated blank node IRI pattern and handle it according to the default behavior (error or merge).Maintain blank node status during export according to format specifications:
writer.c
, ensure that the serialization functions (write_uri_node
, write_curie
, etc.) correctly handle blank nodes according to the RDF format being exported (Turtle, RDF/XML, etc.).Disallow blank nodes as predicates in CREATE
statements:
read_verb
function in n3.c
, add validation to ensure that blank nodes are not used as predicates. If a blank node is detected as a predicate, return an error.Update the validate_triple
function in the hypothesized code to enforce the rule that blank nodes cannot be used as predicates:
def validate_triple(subject: str, predicate: str, object: str) -> bool:
# Validates a triple, ensuring that blank nodes are not used as predicates
if predicate.startswith("_:"):
raise ValueError("Blank nodes are not allowed as predicates")
return True
Ensure that the COPY
and CREATE
statements in the parser (cypher_parser.cpp
and cypher_parser.h
) correctly handle blank nodes according to the rules specified.
This file contains the logic for parsing blank node labels in Turtle files, which is relevant to the issue regarding the handling of blank nodes during import.
third_party/antlr4_cypher/cypher_parser.cpp
This file contains the parser rules for Cypher queries, which may need to be updated to handle blank nodes correctly in CREATE statements as per the issue description.
third_party/antlr4_cypher/include/cypher_parser.h
This file includes definitions for parser rules which may need to be updated in conjunction with cypher_parser.cpp to address the issue.
src/processor/operator/persistent/reader/rdf/rdf_reader.cpp
This file contains the logic for handling RDF data during the reading process, which is relevant to the issue of importing and exporting blank nodes.
When importing blank nodes from Turtle files, we should have the following rules:
COPY UniKG FROM "/path/*.ttl"
, any blank node with label_:foo
will be recognized as the same blank node and get 1 generated iri._:foo
is used again across different COPY statements, it will be recognized as a different blank node._:ibj
, where i and j are two integers. So "_:ibj" is our prefix for blank node IRIs._:ibj
IRI for some node, we do not do something special, i.e., if_:ibj
is the IRI of some node, we do our default behavior (either error saying there is a duplicate IRI or merge if a new relationship is being added etc.)_:ibj
. If we export to RDF/XML, then we omit the IRI or use whatever is the blank node specification convention for the exported file's format.