Closed KOLANICH closed 2 years ago
I like the suggestion. Also, the similar topic was raised some time ago: https://github.com/antlr/grammars-v4/issues/941 But it looks like your structure is more thoughtful.
Could you please describe the detailed transform for all grammars in the repository? I'll suggest fixes if it's required.
@KOLANICH I've resisted changes lie this for quite a while. However, with the number of grammars there are now, I think it might be time. I like the structure you've proposed.
One of the reasons I've been concerned to accept a change like this is that I am worried it will be a barrier to people finding a grammar they're looking for. Could an index of grammars be generated and published as part of this?
The problem with any index is that it has to be updated. It can be automated, though.
I think the first thing to do is propose the new directory structure and where the grammars currently reside would be moved to. We don't have a "C++" directory but "cpp", and we don't have a Bash grammar at all.
I'm sure there will be several grammars that fit into multiple categories. For example, I have grammars for many parser generator systems, including tree-sitter, which is a JSON structured-document that represents a context-free grammar. What would these all fall under?
I worry that unless there is an index, I won't be able to find a grammar. As @teverett suggests, perhaps what we should have is a generated index page where one would enter search terms. And if I'm working on a particular grammar, I can set up an alias to combine a find
and cd
to navigate to it at a Bash shell depending on how deep the directory structure is.
Note, the only other realistic grammar database that I know of is Grammar Zoo (index page for the repo). The github repository for this website is https://github.com/slebok/zoo. You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it. Note, each grammar is described by a meta file (zoo.xml) containing the author, date written, how it was written (e.g., "scraped"), source, DOI for papers, etc. See this example: https://github.com/slebok/zoo/blob/master/zoo/ada/ada83/ichbiah/zoo.xml. The meta could contain searchable terms, which would be a way of generating the indexing page.
Grammar Zoo
Thanks a lot for letting me know about this project. In fact I haven't known about Zaytsev work and has created (well, not really "created", it is very immature) something similar (an own DSL with the goal to be transpiled (and work after transpilation) into DSLs of as many different parser gens as possible (also a wrapper is generated to use the built AST uniformly) ), and my main motivation for this proposal was to have them structured, so for me not to get mad when porting your grammars into my DSL.
You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it.
It seems that the organization relies more on XML files than on directory structure, at least https://github.com/slebok/zoo/tree/master/zoo looks like a pile similar to the one we see in this repo.
The hierarchy I propose for this repo is more influenced by the one we (I'm a contributor of that repo) use in https://github.com/kaitai-io/kaitai_struct_formats/ .
perhaps what we should have is a generated index page where one would enter search terms.
Fortunately, one can enter search terms into GitHub search, and it works without JavaScript, but to be honest, I dislike the ranking: https://github.com/antlr/grammars-v4/search?q=json&type=code&l=ANTLR doesn't have the JSON grammars on the first lines.
<source>
<author>Jean D. Ichbiah</author>
<title>Preliminary Ada reference manual; Syntax Summary</title>
<subtitle>ACM SIGPLAN Notices, Volume 14 Issue 6a</subtitle>
<date>June 1979</date>
<specific>pages E-1 to E-5 (142-146)</specific>
<link>
<doi>10.1145/956650.956651</doi>
</link>
</source>
In UG and KS we inline this kind of metadata into grammars themselves under a meta
key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization
In UG and KS we inline this kind of metadata into grammars themselves under a
meta
key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization
The .g4 files can have comments (block /* ... */
, or line //
), so we could embed meta data in a comment. The main problem I have with having this information in the grammar file is that I usually don't want to see all that every time I edit the grammar. I prefer to just see the context-free grammar, nothing else. But, an IDE can hide all that when editing.
Could you please describe the detailed transform for all grammars in the repository?
I have to say, as a retired long-time programmer, I don't find the proposed organization any better than the flat model we currently have. One person's obvious hierarchy is another person's chaos.
I think we'd be far better off with a structured metadata file in each grammar's root directory, and an automatically-recreated index file in the repo's root based on those files.
It is a bit hard to navigate this repo when all the dirs are piled into the main dir.
It is proposed to reorganize it by introducing dirs with demantic names and moving parsers' dirs into them.
The proposed dir hierarchy:
config
- config files and records.grammar
- DSLs describing other grammars.text
- grammars like the ones for tools like ANTLRddl
- DSLs for describing binary grammars, likeprotobuf
,flatbuffers
,capnproto
,FlexT
and so onprogramming
- programming and scripting languages, like C++ or bash.programms
- for parsing output of software, when it is infeasible to use a machine-readable interface.protocols
- for interfacing servers or devices, single command per line, such as SCPI, AT, JTAG consoles, SMTP, stuff like this.serialization
- serialization languages, like JSON, YAML, protobuf and CSV.embedded
- grammars used as parts of other formats, that don't belong to anywhere elseidentifiers
- various identifiers, like SSNs, phone numbers, VIN-codes, UUID and so onnetwork
- network addresses: IPv4, IPv6, MAC, IMEI,products
- product namebers, likeHTE721010A9E630
The rest of identifiers should stay in
root
untill it is decided to where they are to be moved.