KOLANICH commented 2 years ago

It is a bit hard to navigate this repo when all the dirs are piled into the main dir.

It is proposed to reorganize it by introducing dirs with demantic names and moving parsers' dirs into them.

The proposed dir hierarchy:

config - config files and records.
grammar - DSLs describing other grammars.
- text - grammars like the ones for tools like ANTLR
- ddl - DSLs for describing binary grammars, like protobuf, flatbuffers, capnproto, FlexT and so on
programming - programming and scripting languages, like C++ or bash.
programms - for parsing output of software, when it is infeasible to use a machine-readable interface.
protocols - for interfacing servers or devices, single command per line, such as SCPI, AT, JTAG consoles, SMTP, stuff like this.
serialization - serialization languages, like JSON, YAML, protobuf and CSV.
embedded - grammars used as parts of other formats, that don't belong to anywhere else
identifiers - various identifiers, like SSNs, phone numbers, VIN-codes, UUID and so on
- network - network addresses: IPv4, IPv6, MAC, IMEI,
- products - product namebers, like HTE721010A9E630

The rest of identifiers should stay in root untill it is decided to where they are to be moved.

KvanTTT commented 2 years ago

I like the suggestion. Also, the similar topic was raised some time ago: https://github.com/antlr/grammars-v4/issues/941 But it looks like your structure is more thoughtful.

KvanTTT commented 2 years ago

Could you please describe the detailed transform for all grammars in the repository? I'll suggest fixes if it's required.

teverett commented 2 years ago

@KOLANICH I've resisted changes lie this for quite a while. However, with the number of grammars there are now, I think it might be time. I like the structure you've proposed.

One of the reasons I've been concerned to accept a change like this is that I am worried it will be a barrier to people finding a grammar they're looking for. Could an index of grammars be generated and published as part of this?

KOLANICH commented 2 years ago

The problem with any index is that it has to be updated. It can be automated, though.

kaby76 commented 2 years ago

I think the first thing to do is propose the new directory structure and where the grammars currently reside would be moved to. We don't have a "C++" directory but "cpp", and we don't have a Bash grammar at all.

I'm sure there will be several grammars that fit into multiple categories. For example, I have grammars for many parser generator systems, including tree-sitter, which is a JSON structured-document that represents a context-free grammar. What would these all fall under?

I worry that unless there is an index, I won't be able to find a grammar. As @teverett suggests, perhaps what we should have is a generated index page where one would enter search terms. And if I'm working on a particular grammar, I can set up an alias to combine a find and cd to navigate to it at a Bash shell depending on how deep the directory structure is.

Note, the only other realistic grammar database that I know of is Grammar Zoo (index page for the repo). The github repository for this website is https://github.com/slebok/zoo. You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it. Note, each grammar is described by a meta file (zoo.xml) containing the author, date written, how it was written (e.g., "scraped"), source, DOI for papers, etc. See this example: https://github.com/slebok/zoo/blob/master/zoo/ada/ada83/ichbiah/zoo.xml. The meta could contain searchable terms, which would be a way of generating the indexing page.

KOLANICH commented 2 years ago

Grammar Zoo

Thanks a lot for letting me know about this project. In fact I haven't known about Zaytsev work and has created (well, not really "created", it is very immature) something similar (an own DSL with the goal to be transpiled (and work after transpilation) into DSLs of as many different parser gens as possible (also a wrapper is generated to use the built AST uniformly) ), and my main motivation for this proposal was to have them structured, so for me not to get mad when porting your grammars into my DSL.

You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it.

It seems that the organization relies more on XML files than on directory structure, at least https://github.com/slebok/zoo/tree/master/zoo looks like a pile similar to the one we see in this repo.

The hierarchy I propose for this repo is more influenced by the one we (I'm a contributor of that repo) use in https://github.com/kaitai-io/kaitai_struct_formats/ .

perhaps what we should have is a generated index page where one would enter search terms.

Fortunately, one can enter search terms into GitHub search, and it works without JavaScript, but to be honest, I dislike the ranking: https://github.com/antlr/grammars-v4/search?q=json&type=code&l=ANTLR doesn't have the JSON grammars on the first lines.

KOLANICH commented 2 years ago

<source>
        <author>Jean D. Ichbiah</author>
        <title>Preliminary Ada reference manual; Syntax Summary</title>
        <subtitle>ACM SIGPLAN Notices, Volume 14 Issue 6a</subtitle>
        <date>June 1979</date>
        <specific>pages E-1 to E-5 (142-146)</specific>
        <link>
            <doi>10.1145/956650.956651</doi>
        </link>
</source>

In UG and KS we inline this kind of metadata into grammars themselves under a meta key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization

kaby76 commented 2 years ago

In UG and KS we inline this kind of metadata into grammars themselves under a meta key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization

The .g4 files can have comments (block /* ... */, or line //), so we could embed meta data in a comment. The main problem I have with having this information in the grammar file is that I usually don't want to see all that every time I edit the grammar. I prefer to just see the context-free grammar, nothing else. But, an IDE can hide all that when editing.

KOLANICH commented 2 years ago

Could you please describe the detailed transform for all grammars in the repository?

2830

RossPatterson commented 2 years ago

I have to say, as a retired long-time programmer, I don't find the proposed organization any better than the flat model we currently have. One person's obvious hierarchy is another person's chaos.

I think we'd be far better off with a structured metadata file in each grammar's root directory, and an automatically-recreated index file in the repo's root based on those files.

antlr / grammars-v4

Reorganize this repo: distribute the dirs over categories #2826

2830