JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.48k forks source link

[TOML stdlib] Support parsing comments into data structure so they can be written back out #42697

Open ericphanson opened 3 years ago

ericphanson commented 3 years ago

so we can e.g. use comments in Project.toml's without Pkg removing them.

I found https://github.com/toml-lang/toml/issues/284 which led me to https://github.com/joelself/tomllib (rust, 5 years since the last commit so maybe not supporting TOML v1) and https://github.com/sdispater/tomlkit (python, actively updated) which do support comments.

I wonder what would be involved in doing this? I wonder if a TOMLDict would have to change from just a Dict{String, Any}.

KristofferC commented 3 years ago

I thought about this but one reason why I didn't attempt this is that the parser here is also used in Julia Base for loading pacakge and there you want it to be as simple as possible.

ericphanson commented 3 years ago

That makes sense, but it seems a real shame to not allow comments in Project.tomls or things like preference files (though I haven’t used those yet). Maybe Base can use something like a (vendored copy of?) an older version of the package until all the bugs are worked out, eg some kind of LTS release of the package.

KristofferC commented 3 years ago

There is no strict requirement that Base uses the same parser, no. The first step however is to actually have a TOML parser that produces a "concrete syntax tree". You would need to keep track of comments, indentation, if tables are inline or not etc. Probably a bit tricky to get right.

StefanKarpinski commented 3 years ago

This is very tricky to do. As @KristofferC says, you'd want to parse a "concrete syntax tree" instead of a simple data structure that is currently returned. But you still want that to behave like a data structure since otherwise you can't access the parsed data. Ok, so maybe you can have a concrete syntax tree type that also behaves like a dict/array/whatever and remembers comments. Does it also allow modification of the data structure? What happens to comments when you modify the data structure? I can come up with messy examples, but I suspect you can see what I'm getting at. The interaction between comments and data structure modification is just super unclear, let alone the implementation. Note that without modification this is a non-issue: if you only need to read a TOML fine and not modify it then preserving comments doesn't matter. That doesn't mean that this is impossible, but before any implementation is considered, the first step is having some reasonable rules about what happens to comments when the TOML data is modified.

ericphanson commented 3 years ago

Yeah, that makes sense. I was thinking maybe there can be a canonical form that it could be parsed into, so that it wouldn't fully roundtrip with spacing and formatting and everything, but could at least preserve the contents of the comments. E.g. every value in the nested set of dicts is a tuple with the actual value and any comments encountered in the parsing of that value, and they get written out in some canonical way (the comment is always written the line above the value or something). With something like that, I think modification is more straightforward; either you modify the comment or not, and if you drop the key then you lose the comments. And the dict with tuple values could of course be an AbstractDict that exposes the value element of the tuple as the value for get, getindex etc with some helper to expose the comments when needed.

Obviously there's a lot of details in there to figure out in there still (how to choose which value to associate each comment to, how to write them out etc).

StefanKarpinski commented 3 years ago

Another consideration: currently we canonicalize the format of a file when we write the TOML out — we sort table keys and sections. We also convert some ways of expression data structures to other ways of expressing the same data structure. Would this involve not doing that and keeping things in the same order? If so, I'm not sure it's worth it. Keeping the file in a canonical form is really quite nice, IMO.

Maybe we can simplify the scope of this: what kinds of comments did you want to preserve? Some comments are much more obvious how to preserve than others.

ericphanson commented 3 years ago

FWIW it looks like tomlkit does the full white-space preserving thing in 4k lines of python, which doesn't sound absolutely terrible. (edit: I'm not saying we should do that though).

Would this involve not doing that and keeping things in the same order?

Yeah, I don't think it needs to and I don't think it's incompatible with canonicalizing, we just should also parse and canonicalize the comments IMO.

Maybe we can simplify the scope of this: what kinds of comments did you want to preserve? Some comments are much more obvious how to preserve than others.

Hm, I was thinking things like any subtleties in compat annotations or maybe explaining dependencies

name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"

[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"

[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
StefanKarpinski commented 3 years ago

tomlkit does the full white-space preserving thing

Would be helpful to see what its rules are. Based on your example, it looks like the rule is that a comment on a line by itself is associated with the following item and a trailing comment is associated with the item where it appears. What about comments before a table section?

name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"
# what is this comment attached to?
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"

[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]

Do blank lines affect this? Is there a way to put a comment at the top of the file that won't get deleted if the first key in the file is deleted? Is there a way to put a comment at the end of a section without it being attached to the following section?

Btw, I suspect that access to reading or modifying comments is not worth it: they should be viewed as something that humans insert into the file manually which are only for human consumption.

StefanKarpinski commented 3 years ago

tomlkit seems to have zero documentation of any of its design decisions for how comments are associated with TOML data, which is not exactly reassuring. I would guess that its behavior is full of unpleasant surprises.

ericphanson commented 3 years ago

Hmm, maybe there should be a notion of a top-level comment as well? So something like

# this is a "top-level" comment since there is a blank line following it

name = "MyPkg"
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"
# This comment is attached to the `deps` table since there is no blank line separating it
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
# `Dep` is used for fooing bars
Dep = "0796e94c-ce3b-5d07-9a54-7f471281c624"

[compat]
julia = "1"
Dep = "~1.1" # see issue#123 before bumping to v1.2

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
StefanKarpinski commented 3 years ago

Ok, now we're getting somewhere. So there seem to be three kinds of standalone comments (as opposed to comments on the same line as or inside of some data):

That still leaves some questions. What about this:

foo = 123
# comment

bar = 456

Is this comment associated with foo or bar? Neither? It looks more attached to foo than bar. What about:

foo = 123

# comment

bar = 456

This comment doesn't look associated with either foo or bar but we have to associate it with one of them if we're going to canonicalize the ordering of our TOML output. If, on the other hand, we preserve item order instead of canonicalizing it, then we could just preserve that this comment comes between foo and bar. That doesn't entirely get us off the hook though: what happens if we insert a value between foo and bar? We have to decide whether the comment comes after foo or before bar. One guess at what would be intuitive is that you count blank lines and associate with the closer item with ties going to the following item. But what about this then:

foo = 123
# 1

# 2

# 3
bar = 456

Do we view this as one big comment block attached to bar? Or is it three separate comments? In which case it seems clear that we'd want to attach comment 1 to foo and comment 3 to bar. But what about comment 2?

ericphanson commented 3 years ago

Those rules make sense. I wonder though if one could get away without trailing comments? And just have "any top-level comment in the section is canonicalized as a leading comment". So then

foo = 123
# 1

# 2

# 3
bar = 456

would be canoncialized as

# 1
# 2

foo = 123
# 3
bar = 456

And I think the same could be inside a table etc (though maybe this is what you mean by a section). In other words, a comment can either be attached or not, and if it is not attached it shows up at the top of whatever container it is in (the whole TOML file, the table, etc).

And for what counts as attached I would get my intuition from docstrings, so line above works, line below does not. Same line works.

StefanKarpinski commented 3 years ago

We could, but eliminating trailing comments doesn't simplify the problem at all: if we decide that comment 2 belongs to foo rather than bar, why not just let it stay where it is? The hard part is deciding which item to associate it with in the first place. Once that's decided, not much is gained by moving it above the item. It would simplify the problem if we either threw an error for an ambiguously placed comment (like comment 2 here) or just deleted it, but I don't think either of those behaviors would be acceptable to people.

ericphanson commented 3 years ago

We could, but eliminating trailing comments doesn't simplify the problem at all: if we decide that comment 2 belongs to foo rather than bar, why not just let it stay where it is?

Ah, I'm not saying 2 is associated to foo. I'm saying 3 is attached to bar, and no other comments are attached to anything, and all floating comments should just get printed at the top of whatever section they are in.

StefanKarpinski commented 3 years ago

Ah, ok. That's certainly simple enough. Not sure how people will feel about it though — I suspect that people will feel that this is moving their comments around arbitrarily, but maybe this is fine. I do think that at least inline comments should be supported and remain attached to whatever item they're next to.

StefanKarpinski commented 3 years ago

Note that rather than modifying TOML to return a special concrete syntax tree data structure, you could instead have a comments = Dict() keyword to the parse functions that populates a data structure with comments. Later when printing TOML back out, you could pass the same data structure and it would be used to emit comments. Not sure what the best structure would be but you would basically just have to make the node path as a tuple of strings to a comment value; becomes a bit more complicated if you want to distinguish different comment placements.

ericphanson commented 3 years ago

Not sure how people will feel about it though — I suspect that people will feel that this is moving their comments around arbitrarily, but maybe this is fine.

Yeah, I agree it might be seen as annoying. I think however going from "comments get deleted" to "attach your comments like docstrings or they get moved to the top" is a big win and a simple enough rule that I'm hoping folks will just be happy to have comments and quickly learn the rule. Though maybe once we get used to having comments we will want more and more 😅.

But I do think moving is a lot better than deleting (deleting is scary especially if it seems unpredictable, e.g. you don't know the rules, and feels like data loss) so I think some kind of fallback is needed if we don't want to support all positions, and putting them at the top seems simple enough and supports the "intro comment" usage. But maybe there's a better fallback.

StefanKarpinski commented 3 years ago

Trying the "comments are just a block of text that goes before an element" approach seems like a good idea. I like that it associates just one (multiline) comment string with each node in the TOML data structure. It would transform a comment like this:

foo = 123 # comment

into a comment like this:

# comment
foo = 123

but I think that's ok. Similarly it would turn this:

# comment 1
foo = 123 # comment 2

into this:

# comment 1
# comment 2
foo = 123

I would probably make the API like this:

data, comments = TOML.parsefile(file, comments = true)
TOML.print(io, data, comments = comments)

The comments data structure should probably not be specified as you might want to change it in the future, but a Dict{NTuple{N, String} where N, String} where each key is a tuple of keys in the data structure might work well.

cossio commented 2 years ago

What's holding this up?

I don't know what other people think, but I'd be fine with this approach.

ericphanson commented 2 years ago

What's holding this up?

As always in open source, the default is for nothing to happen, because someone needs to spend time for it to be otherwise. So "no blockers" doesn't mean "it'll happen now", it means "now anyone who wants to spend time on it is empowered to do so". So comments like this are pretty annoying in my opinion, because usually it means you want other people to spend time on it but don't want to spend your own.

edit: FYI, you can convey pretty much the same thing without annoying people as much by saying something like "+1 for this feature, it would be great to have. The approach sounds fine to me." The key difference is that would be expressing your desire for the feature without implying it is someone else's duty to implement it for you.

cossio commented 2 years ago

Didn't mean it to be read in any other way, apologies if the tone came out wrong. Also was genuinely curious if there was any fundamental blocking issue.

I actually thought (probably very naively) that this wouldn't be so difficult.

A practical solution (perhaps) is to support only small comments at the end of a line:

[compat]
DiffRules = "1.8" # I need this because of ...

For me this is the most important thing, since I always forget why I put a compat bound.

I would say if a comment is at the end of a line like this, then it's fine to remove that line (with the comment) after pkg> rm DiffRules.

All the modifications Pkg does to Project.toml file are adding or removing entire lines (is that so?). It also reorders lines sometimes ([deps] are put in alphabetical order? is this necessary?).

What if: Before modifying the Project.toml, we copy all the lines into a variable. Do all the modifications from TOML. Then at the end, compare the resulting file, with the copy we made at the beginning. We check whether a line in the original file starts with the same string as a line in the new file, but differ only after the # character; if so, reinsert the thing following # (the comment).

This hack though would probably go on Pkg.jl, not in TOML.jl.

Again, this is probably naive since I'm only considering how I typically use my Project.toml, and there are likely edge cases I'm not seeing.

ericphanson commented 2 years ago

Didn't mean it to be read in any other way, apologies if the tone came out wrong. Also was genuinely curious if there was any fundamental blocking issue.

Ah okay, sorry for misinterpreting. I don’t think there are any fundamental blockers; seems like the API Stefan laid out would work and it “just” needs implementation. I’m not really sure how hard that would be; I took a quick look and saw comments are currently handled with a few functions that skip them when encountered, so I guess that would have to be replaced with functions that parse them and associate them with the right keys.