apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.08k stars 643 forks source link

More granular control over Blank node serialization #2549

Open TheMessik opened 2 weeks ago

TheMessik commented 2 weeks ago

Version

4.10.0

Feature

When serializing a DatasetGraph into NQ format, I find that all blank nodes with specified labels get a "B" prepended to the label, e.g. a blank node with a label "students" would be serialized as "_:Bstudents". This is somewhat annoying for my use case: an RML engine needs to follow a particular spec, including filling in blank node patterns.

My workaround currently consists of Regex replacing, but this is far from ideal.

I'd like to suggest a more granular control of how the NQ writer (and all writers in general) handle Blank nodes: give the user an option to preserve the original blank node without prepending a "B" in front of the label.

Code example that performs the serialization:

DatasetGraph graph = ...; // some graph
OutputStream out = new ByteArrayOutputStream();
RDFWriter.source(graph)
    .lang(Lang.NQ)
    .output(out);
String serialized = out.toString().replaceAll("_:B", "_:");

Are you interested in contributing a solution yourself?

Perhaps?

afs commented 2 weeks ago

Hi @TheMessik,

Blank nodes from data from a parser will be large random numbers. So I'm assuming you are controlling the RDF production and setting the blank node label yourself.

The RDFWriter builder doesn't currently provide a way to set the NodeFormatter. It would be good to add this.

If you want to read such data in, and preserve the label (with care!), then use RDFParser.create().labelToNode(labelToNode) with LabelToNode.createUseLabelAsGiven(). Your code is responsible for blank node label uniqueness and the rules about what happens on graph merge and reading files multiple times.

For writing: NodeFormatter is the interface for controlling the RDF term output.

In extending RDFWriterBuilder, interfaces WriterGraphRIOT and WriterDatasetGraphRIOT, the low level per-format interfaces, will need changing.

There several kinds of writer for the N-Triples/Turtle family of syntax - streamed, flat, batching and collecting - all use a NodeFormatter.

At the RDFWriter level, there isn't the "writer profile" abstraction like there is when reading (where there is a node maker FactoryRDF carried by ParserProfile).

N-Quads is the simplest output form. It is streamed and uses WriterStreamRDFPlain.

Below is the code that is used for N-Quads. You could use that, modified at NodeFmtLib.encodeBNodeLabel to just use the label. Be careful - some characters aren't legal in a blank node label string.

    public static void main() {
        String input = "_:x <x:p> <x:o> .";
        Graph graph = RDFParser.fromString(input, Lang.NT).toGraph();
        AWriter out = IO.wrapUTF8(System.out);
        NodeFormatter fmt = new NodeFormatterNT() {
            @Override
            public void formatBNode(AWriter w, String label) {
                w.print("_:");
                String lab = NodeFmtLib.encodeBNodeLabel(label);
                w.print(lab);
            }
        };
        StreamRDF stream = new WriterStreamRDFPlain(out, fmt) ;
        StreamRDFOps.graphToStream(graph, stream);
    }

Hope that helps