atextor / turtle-formatter

Java library for pretty printing RDF/Turtle documents
Apache License 2.0
16 stars 1 forks source link

Option to retain comments? #2

Open renefritze opened 2 years ago

renefritze commented 2 years ago

Would you a) be willing to have that config option and b) how hard would you think this would be to add?

atextor commented 2 years ago

Hi, this is something that has been bothering me as well for a long time (much longer than this project exists). The answer to a) is easy: I'd love to have that option. However, the problem here is that the RDF libraries throw away all information about comments in their parsers. I know this to be true for Apache Jena and believe RDF4J works similarly. But even if we were to implement our own RDF/Turtle parser, a regular Jena model does not even has methods to work with the comments. I see only three ways to deal with this:

TBH: I have thought about option 2 more than once, but it's not going to happen soon (from my side) due to limited time. So the answer to b) is unfortunately "very hard".

renefritze commented 2 years ago

That is indeed unfortunate. Thank you for the detailed explanation though! Both non-stoic solutions are way out-of-scope for the project I'm working on, so I cannot really commit to those. If you ever do get started on option 2 though, feel free to ping me and I'll try to help.

renefritze commented 2 years ago

Just FYI: Apparently rdflib on the python side also doesn't handle comments

VladimirAlexiev commented 2 months ago

@renefritze How can you determine to WHAT a comment applies? This formatter reorders terms to a predictable manner, so you must answer this question in order to know where to attach the comment in the reformatted version.

The only format that caters for comments is Relax NG Compact (a XSD schema language). Eg this

   element Subject {
     ## There can be one and only one preferred ancester branch, but zero to many non-preferrened ancester branches
      element Ancestors {
        (element Preferred_Ancestor_Branch {

is translated to <a:documentation> in this

<rng:element name="Subject">
            <rng:interleave><rng:element name="Ancestors">
                <a:documentation>
                  There can be one and only one preferred ancester branch, but zero to many non-preferrened ancester branches
                </a:documentation>

                  <rng:interleave><rng:element name="Preferred_Ancestor_Branch">

But RDF doesn't have such features.

As a consolation, every comment in an ontology should usefully be expressed as a specific annotation, eg skos:editorialNote, skos:example, etc.

I think you should close this

atextor commented 2 months ago

Hi @VladimirAlexiev, of course you are correct in that there is no built-in logic in RDF to what a comment applies. But there are ways that can be addressed on the level of the Turtle format, and there are valid use cases for this, which is why this issue is not closed and should be addressed instead.

I think there are roughly four types of comments that appear in RDF/Turtle:

By hooking into the parser (or creating a custom parser, see my first comment), you can figure out which type of comment you have and then either reifiy it in the model or provide access to it using a custom API (which I mean with "an own implementation of a Jena model").

Let's say you have this file:

# Copyright ACME Corporation

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

#
# Classes
#

# The class of all persons
:Person a rdfs:Class .

#
# Properties
#

# The name of a person
:name a rdf:Property .

You could end up with a model with reified comments:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix formatting-meta: <https://github.com/atextor/turtleformatter/meta#> .

:Person a rdfs:Class ;
  formatting-meta:comment "The class of all persons" .

:name a rdf:Property ;
  formatting-meta:comment "The name of a person" .

[
  a formatting-meta:headerComment "Copyright ACME Corporation"
]

[
  a formatting-meta:sectionComment """\nClasses\n\n"""
]

[
  a formatting-meta:sectionComment """\nProperties\n\n"""
]

What is missing here is the location information about the section comments, which might work like this:

@fkleedorfer has already created the base for this custom-parser approach in #15.

One more note on the validity of doing this in the first place vs. just expressing comment information in the RDF graph itself: IMO both techniques have their use and although you'd lose textual comments from an RDF/Turtle document once you put it in a triple store or start processing it strictly on RDF-level, RDF/Turtle files are still often treated like source code, where you add hints and sometimes comment out a section. Last but not least, the comment-based copyright/license header is important for source code license management where many tools just parse comments, but not an RDF document (even though it would be cool to get this info directly from the RDF file using SPDX vocabulary, but I digress). This project is about formatting the RDF/Turtle after all, and at least in this scope, it's definitly valid to want to retain comments.

renefritze commented 2 months ago

I'm no longer working on the project I wanted this for, so feel free to close.

fkleedorfer commented 2 months ago

RDF/Turtle files are still often treated like source code, where you add hints and sometimes comment out a section.

case in point: a contribution to QUDT with addtional information for reviewers, making it possible to assess correctness: https://github.com/fkleedorfer/qudt-public-repo/blob/ef3e91832691e59e02317ac98c79bdd8eaba2b07/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl#L228

In this case, I want formatting applied and leave the comment as-is, attached to the follwing statement, if reordering must happen (which might well be the case).

I see no value in transforming comments to triples. If you wanted a comment triple, you'd write it as a triple. To transform comments to triples is essentially a macro for writing TTL. Possibly that's another interesting project but it's not code formatting.

Preserving comments has value for me. Think of unit test input files with information for the maintainer of the file (these triples for testing this, those for testing that; comment something out occasionally while developing the tests - all of these are real examples.

fkleedorfer commented 2 months ago

The custom parser in #15 already gets row/col information for each new Node. What would be missing is an additional parse that only reads comments (with their positions), and then figures out which comments are blocks, and which comments should be associated with which RDF node. All totally doable.

atextor commented 2 months ago

@fkleedorfer I should have clarified, IMO the only point of having comments reified into triples is to have the information available when writing the model out into a file again; the comment meta information should of course then be converted to proper RDF/Turtle comments again. The reificiation is not necessary just to do this (it could also be kept beside the model), but it would allow for the roundtrip Turtle->parse->write RDF to triple store->retrieve->serialize Turtle.

VladimirAlexiev commented 2 months ago

If you tackle this, you guys are my heros! You give enough use cases to justify the need.

But there be dragons:

atextor commented 2 months ago

Yes, this is definitely non-trivial :dragon:. IMO this can only be solved in a "best-effort" way, i.e., the solution handles for example the 3-4 well-defined cases and for everything else the result is arbitrarily/randomly/badly formatted. This would still solve >=80% valid cases and would be massively useful, even if it's not a 100% solution.

fkleedorfer commented 2 months ago

agreed - the 80% being that comments are just reproduced in the output as-is if the input was already turtle-formatter formatted (no changes in triple ordering).

If there are changes in ordering, comments might not always end up where a human expert would like them, but users really would not expect them to because it's obviously hard in some cases.

hoijui commented 2 months ago

boah... all of you.. I have been dwelling about these issues for so long, and never found anything/anyone even remotely seeing this as an issue ... am so glad to have found you, and for this tool, @atextor ... thank you!!! I started to extend a rust formatter (turtlefmt) a few weeks ago, which does parse comments and has location information. That formatter currently only does very limited formatting, nowhere near what this software does, and I wanted to get it to this level.

Side Note: The author of that formatter also maintains the oxrdf/oxigraph rust RDF ecosystem, whichs parsers also don't parse comments, and wrote a completely separate parser just for this tool.

Now having found this (thanks to @VladimirAlexiev , thanks you! ), I will hop over and focus on turtle-formatter.

To finally get to something substantial:

My personal, current favored way to tackle this issue, is also to convert comments to RDF, but to let the user do this manually: If comments are found, issue a warning/error and abort the formatting (this should not be the default for a generic tool like this one, of course). In the warning/error message, give a link to a page that explains/suggests how to convert comments to RDF. I think this would cover almost all possible uses cases of comments, even though some would be somewhat hacky. The REUSE tool (scans for SPDX license headers) for example, is happy to parse this:

<>
  a owl:Ontology ;
  a spdxt:File ;
  schema:comment """
# SPDX-FileCopyrightText: 2024 John Doe <john.doe@mail.com>
# SPDX-License-Identifier: GPL-3.0-or-later
""" ;
  .

because it only does a very simple regex-match of each line for all file-types. If it did parse the actual syntax instead, it could pick up the actual SPDX RDF related triples, so either way we are safe.

Comments to the ontology/file and to subjects are trivial to convert. Section/Group comments can be handled by using skos:Concepts, skos:Collection and skos:memberOf or something similar.

Commented-out code can be handled similar to the REUSE example above, adding the comment either to the related subject or to a new blank-node (if the subject it is related to is commented-out itsself, or it is a @base or a @prefix). Alternatively, it could be marked as deprecated, unused or the like using RDF (e.g. with vs:term_status set to 'testing' or 'archaic'). ... I'd categorize in-between "not very nice" till "very ugly", but it is still my currently preferred way of doing it. It also gives a high motivation of getting rid of that TODO that is often hidden behind such a comment.

I also like the formatting-meta approach you mentioned @atextor. If going there though, and putting in all that work, I would rather go all the way though, so that more RDF parsers and formatters for other languages and serialization-formats could be readily invited to use that ontology and maintain it together. In the best case, this ontology could even serve for non-RDF source files, as this is an issue for auto formatters in almost any language.


Sorry for this long, not very well structured intro. My thoughts and ideas were locked up in my head, with only me working on them for too long, and I have much to learn about possible issues and what the data looks like out there and practicality and so on. I'll hope to be able to do that with you here. Thanks already for all the work you have done!

atextor commented 2 months ago

Thank you for the kind words @hoijui. One more thing I forgot to mention: The main advantage of putting the comment information in the model itself using a meta-vocabularly would be that this model would continue to work with all regular Jena & other APIs (except for writing the .ttl file, obviously), so you could pass it around without the need to create a specialized Model interface.

VladimirAlexiev commented 2 months ago

I really like capturing comments in RDF. Then it's up to the ontology maintainer to dispatch the text to appropriate fields (or delete it). I can write the instructions, I've been doing it in git issues for many ontologies. Eg skos:editorialNote, skos:scopeNote, skos:example. Use a custom prop/namespace to avoid stepping on toes . Issue a warning not error.

Can you guys recognise commented-out? How?

atextor commented 2 months ago

@VladimirAlexiev, yes, I absolutely see the advantage of using a proper vocabulary and putting comments in the RDF itself from the start. If we were discussing a "best practices" guide or an ontology governance process I'd be in favor of it (with the exception of the copyright/license block at the top of the file). And this issue is orthogonal to it: if an RDF/Turtle file contains "real" comments, how do we deal with it in order to retain them when the file is formatted. The idea is that when such a model is parsed, the graph will contain the comments reified as triples (using a custom vocabulary), but as soon as it's written to a file, those triples will be written as regular comments again. A user is not supposed to ever see/edit those comment triples.

VladimirAlexiev commented 2 months ago

Re-serializing comments is possible in turtle and rdf/xml but not in jsonld.

A user is not supposed to ever see/edit those comment triples.

By "user" do you mean the ontology maintainer? The need for the maintainer to dispatch such triples from the custom namespace into appropriate fields will hasten the improvement of the ontology.

Let's have an option to serialize as comments vs as triples.

fkleedorfer commented 2 months ago

Let me try to see if I understood the approaches correctly:

  1. Comments are source code. They never get parsed to triples, but it would be nice if they could be reproduced in the re-serialization

  2. Comments should be parsed to RDF so they can be reproduced in the re-serialization (i.e, an approach to implement 1.)

  3. Comments should be parsed to RDF and the re-serialization should have no source code comments but additional triples instead.

If that is correct, then I think 1 should be the default (however it is implemented internally). Reason: formatting should never alter the data unless the user explicitly requests that.

hoijui commented 2 months ago

To me, there is a 4th option:

  1. Comments should not be used in RDF, and if present, should be converted to triples by hand. If encountered, a warning/error should be emitted, suggesting to convert them to triples, and ignoring them completely, otherwise
atextor commented 2 months ago

To me, there is a 4th option:

4. Comments should not be used in RDF, and if present, should be converted to triples by hand. If encountered, a warning/error should be emitted, suggesting to convert them to triples, and ignoring them completely, otherwise

IMHO this approach belongs in a best practices guide, but is no valid option for a formatting tool: It's not for the tool to decide what authors of valid RDF/Turtle files should do, but "just" work correctly for every valid input.

fkleedorfer commented 2 months ago

Maybe it is a scope issue. One way to look at it is: A formatter should never change content (and comments are content), only form.

Then, all functions that do touch content (double formatting, comments transformation), must be done one multiple other tools (a comment triplifyer or a number handler, maybe also a style watchdog).

We are packing everything into the turtle-formatter because there is no concept of doing multiple passes over an rdf file. In diffplug/spotless, there is, and it would allow for separating concerns nicely. We could, over time, make that happen.

The discussion about comments would not be taking place if rdf parsers retained them in the model. Nobody would expect them to be altered in any way except presentation.

atextor commented 2 months ago

Maybe it is a scope issue. One way to look at it is: A formatter should never change content (and comments are content), only form.

Then, all functions that do touch content (double formatting, comments transformation), must be done one multiple other tools (a comment triplifyer or a number handler, maybe also a style watchdog).

:100: agree.

We are packing everything into the turtle-formatter because there is no concept of doing multiple passes over an rdf file. In diffplug/spotless, there is, and it would allow for separating concerns nicely. We could, over time, make that happen.

turtle-formatter should remain exactly that, a formatter. However, yes, for real-world usage, other tools are required as well. This is another one of the reasons for the planned https://github.com/atextor/owl-cli/issues/20 that will merge turtle-formatter as one tool into a broader scope. Beyond formatting I also envision tools such as code style checks, code smell checks, maybe convenient SHACL validation etc.

The discussion about comments would not be taking place if rdf parsers retained them in the model. Nobody would expect them to be altered in any way except presentation.

You are right; and I still consider the introduction of a Parser with a custom model that retains comments as a valid option (see the second bullet point in my first reply in this issue).

hoijui commented 2 months ago

The warning/error approach explicitly requires a best practice guide, and the message would point to it. This guide would explain how to best (manually) convert comments to triples. All other approaches would implicitly assume the usage of a (much more complex) best-practice guide, that defines how comments should be placed in the code so the formatter assigns them correctly to parts of the code, and makes them reappear in the output where the user would want them to be. Almost certainly, this guide would also have to include special in-comment syntax, to prevent potential mis-interpretation of comments (think e.g. of commented out code that appears in a place where it could also be a doc-comment for the next line).

I use rust a lot, and there, virtually everyone uses rusts own auto-formatter for each git commit. It is very good, but I have seen it do a lot of mistakes (from a human point of view), all of which have comments as their content. This to me clearly proves, that going this way, will not result in something clean, and I think it would be a lot more work. If going that way, I would use a separate tool (or at least a separate owl-cli command) that does the triplification of comments automatically, with clear indications to the user, that his will very likely create undesired results, so they necessarily will have to revise the results manually.

I understand the idea of wanting a tool that just works, but with the sophistication that this tool already has, it is impossible to do without errors. These errors would likely not be noticed any-time soon, unless.. maybe months or years later, someone would read a comment that seems to refer to code, but makes no sense in the place it is, and then what? Most likely the original could not be retrieved, as people tend to use auto-formatters before making a commit (and CIs will likely make that a necessary requirement). Thus it is very important not to make any error, in my eyes.

fkleedorfer commented 2 months ago

think e.g. of commented out code that appears in a place where it could also be a doc-comment for the next line

Good point.

I understand the idea of wanting a tool that just works, but with the sophistication that this tool already has, it is impossible to do without errors. These errors would likely not be noticed any-time soon, unless.. maybe months or years later, someone would read a comment that seems to refer to code, but makes no sense in the place it is, and then what? Most likely the original could not be retrieved, as people tend to use auto-formatters before making a commit (and CIs will likely make that a necessary requirement).

Also a very good point.

The formatter action leading to a problem in these cases is not reproducing comments, though. The root cause would be reordering of statements, which happens, for an unformatted file, once with a big impact, and subsequently only with little impact as changes and additions are made. Compared to not having the option of using comments at all, it still seems the better alternative to me.

hoijui commented 2 months ago

I also think we should have that option, but it should not be the default. By default, it has to be 100% non-destructive.