antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.13k stars 3.69k forks source link

META: Create ANTLR Grammar from xml schema definition #1971

Open DerIltis opened 3 years ago

DerIltis commented 3 years ago

Dear all,

I recently received the task to "parse" xmls that obey a given (xsd) schema. Instead of manually translating the xsd definition to an ANTLR grammar, I was wondering if you know of something like a "grammar generator for ANTLR" (or any other Parser Generator), which I could use?

With kind regards

ghost commented 3 years ago

XML validator is more suitable than ANTLR

DerIltis commented 3 years ago

Yes, I see the point that an xml validator is suitable for most purposes, in particular for the major task of validating whether an xml conforms to a given xsd.

However, my question is related to a slightly different point: In my work, I am commonly asked to create a grammar for a specific file format/mini language, not for a specific purpose, but in order to enable different user groups to access the data in large sets of files that are "language compliant". As such, I do not care much about validation - I simply assume that each xml given is compliant to the xsd schema, but I want to enable access to the data that is contained in them.

I commonly prefer providing ANTLR grammars for such tasks, because

At the moment, I tend to think of a xsd schema as being something like a grammar, simply because to my understanding it defines the set of strings that are conformant with the xsd schema (and which are valid xml files in this case) - just as an ANTLR grammar defines the set of strings in an LL(). (Although I don't know whether an xsd schema actually defines a context free language at all, and if such a language is in, say, LALR(1) or LL()).

This "intuitition" is supported by the fact that there are different tools that create "classes" corresponding to a xsd in specific object oriented languages [1,2,3,4]. With these tools, the (one) xsd schema definition relates to the set of valid xml files just as the corresponding class to the set of instances that can be created from the xmls.

For a specific use case and a particular target language, I can probably chose one of these "xsd-to-class" tools. However, I would like to stay more abstract, as otherwise I would lose both of the advantages of grammars mentioned above:

Therefore I was thinking if there is a more generic approach. Shouldn't it be possible to automatically "translate" an xsd schema definition into an ANTLR grammar, which I could then provide to allow others to implement their use cases just as I do it with other data formats? If not, is there a specific (theoretical) point why this cannot be done? (Is the language for an arbitrary xsd not context free? Will omitting some features of xsds end up in CFGs for these restricted xsds? See also a related question on W3C mailing list from 2005 [5])

Did you come across any tool that supports something similar, but maybe with different means as CFGs?

[1] Python: http://www.davekuhlman.org/generateDS.html [2] C#: https://stackoverflow.com/questions/5217665/how-to-generate-net-4-0-classes-from-xsd [3] Java: https://javaee.github.io/jaxb-v2/ [4] C++: https://code.google.com/archive/p/xplus-xsd2cpp/ [5] https://lists.w3.org/Archives/Public/www-xml-schema-comments/2005AprJun/0058.html [6] https://en.wikipedia.org/wiki/Simple_API_for_XML [7] https://www.baeldung.com/java-sax-parser

DerIltis commented 3 years ago

Hi there,

Is there someone out there who can leave a useful comment on my reply from Devember?

inventivejon commented 3 years ago

Unfortunately I don't have an answer. But if you stumble over a useful solution please let me know. I have a similar task to solve... BR

DerIltis commented 3 years ago

I just found this quote from Terence Parr:

"E.g., [...] any XML pages whose form can be expressed with a DTD [are context-free]." (Source: https://github.com/antlr/stringtemplate4/blob/master/doc/motivation.md)

This should indicate that it should be possible to create a grammar for them, and this in turn would motivate to automate the generation of a grammar based on the DTD!?