rpoyner-tri commented 2 years ago

This is a reduced-scope version of #12610. For URDF, we do not yet have a mechanism that can reasonably complain about completely unknown wrong input data. The current implementation relies on tinyxml2 to build a "DOM" tree, which the "parser" then effectively performs bespoke queries against. Since tinyxml2 does not even guarantee well-formedness, it is possible to have files full of stray garbage (#16516), near-miss formulations of known tags, etc.

The victory condition of this ticket is to produce a parser that can complain about unknown/malformed elements in a controllable, comprehensible, actionable way.

It may be a lot of work.

Possible paths include:

add a schema checking pass to the URDF parse; requires schema maintenance and possibly tricky diagnostic plumbing.
rewrite the existing tinyxml2 parser to visit every element/attribute and make some decision (implement-on-plant, diagnostic, ignore). ** requires disciplined coding and/or a switch to some visitor pattern (tinyxml2 SAX API??)
use some other library/technology that helps with these problems.

rpoyner-tri commented 2 years ago

See #16229 for an overview of parser improvements.

jwnimmer-tri commented 2 years ago

See also https://github.com/RobotLocomotion/drake/issues/16785#issuecomment-1098445866 for an implementation idea:

For URDF, separate from any other parsing, we could walk the entire tinyxml2 tree from the root on down and check every element name and attribute name against a position-agnostic allow-list (i.e., irrespective of its parent within the tree), producing an error for any completely-unknown name. That would catch typos where the name was wrong no matter where it might live in the tree, but it would not detect elements or attributes that were located in the wrong position within the tree. So, it is not very precise, but on the other hand it's only ~100 lines of code to implement.

Or on another tack, the libsdformat syntax checker and low-level DOM is actually quite good (https://github.com/ignitionrobotics/sdformat/tree/sdf12/sdf/1.9). With maybe a few days of work, we could encode (Drake's subset of) the URDF specification into that format, and then use libsdformat to do the syntax checking. Possibly even OpenRobotics would be interested to collaborate on that.

jwnimmer-tri commented 1 year ago

See https://github.com/RobotLocomotion/drake/issues/18796#issue-1586876256 for another example of an uncaught typo.

jwnimmer-tri commented 1 year ago

Or on another tack, the libsdformat syntax checker and low-level DOM is actually quite good. With maybe a few days of work, we could encode (Drake's subset of) the URDF specification into that format, and then use libsdformat to do the syntax checking.

I did a prototype of this last night (branch).

The sdf::Element API is fine, but the readXml function (and the helpers it calls, like readAttributes) are completely unsuitable for re-use. Instead, I needed to re-implement the "map tinyxml into Element tree" function by hand, which really reduces the value we're getting from using the library.

Another idea we've broached was using a Relax NG schema to check the XML, but then parse the same as we do now. I'd be curious to know if there are any currently-maintained RNG implementations in C++ that we could try.

Another idea would be to write a Serialize-NVP-based mapper from tinyxml nodes to C++ structs. That would have approximately the same benefits at the sdf::Element API (schema checking and string=>double parsing), but without tying ourselves to a library that comes with too much baggage.

rpoyner-tri commented 1 year ago

Ah, schema checkers. I had thoughts about this a year and a half ago, but they've mostly drained away. I will see what I can recover.

Thanks for doing the architecture spike on the sdformat library hypothesis.

rpoyner-tri commented 1 year ago

20193 has the beginnings of a Relax NG compact syntax (RNC) schema. It doesn't yet fully check the `drake:` custom tags, but it could pretty easily do so.

Things I'm happy about so far:

the compact syntax is easy to write and fairly easy to read, with a little study. Think EBNF, regular expressions, stuff like that.
there seem to be multiple command line tools that do a good job of parsing, validating, reporting errors, etc.
The long-in-the-tooth, but recently maintained RNV project:
- has an RNC validator written in ANSI C with only Expat as a dependency. It would be not-so-hard to integrate it into Drake as a library we could call to do validation,
- it runs wicked fast and has good error messages.
I think it would be pretty easy to write some developer-facing documentation to explain available tooling, best practices, etc.

Things that need more study:

I'll have to figure out the right way to write permissive schemas, that allow the kinds of user hijinks we want to allow, and still are strict about other things
How do we support warned-about ("I'm ignoring this bit") tags, in conjunction with a schema? Perhaps a somewhat permissive schema, and some other (coordinated) technique to warn about stuff in the tinyxml pass?

jwnimmer-tri commented 1 year ago

Sounds promising!

rpoyner-tri commented 1 year ago

A quick look through the RNV project source code has me doubting whether it can be easily consumed as a library.

It pervasively uses the old-timey C technique of file-static data -- effectively private global variables.
To be usable as a library, those static variables would have to be audited, and some modules might need state reset functionality added.
Auditing might be tedious, since the coding style is extremely terse. Variable names are compressed in the extreme. Explanatory comments are rare.
Minor, but not to be overlooked: a widely-used memory allocation wrapper calls exit(1) if allocation fails.

A possible alternative would be to invoke the rnv binary executable as a child process.

jwnimmer-tri commented 1 year ago

For runtime validation by our Parser it sounds like RNV might not be the best fit. I like the RNG compact file for the offline schema, though.

It appears that there are tools to convert RNG to a DTD. As I understand it, in general not all RNG schema are expressible as DTD but I think for the simple things we have in URDF it would be? We could keep the RNG in git, and have Bazel convert it to a DTD. I presume that there are a lot more C++-compatible libraries to check the DTD, and we could lean on one of those instead.

rpoyner-tri commented 1 year ago

RNC/RNG can also be converted to XSD, for which we could also certainly find validators in existing libraries. (XSD ==> RNG is apparently not something that is done, owing to the complexity of the XSD spec.)

rpoyner-tri commented 1 year ago

It may turn out that the translation tools (bazel build time only, as I read the above proposal) might be implemented in Java(?!?!). I suspect we don't care if we can get system/Brew packages for them, but it's a detail to watch out for.

jwnimmer-tri commented 1 year ago

To the extent any schema converters are not as portable as we might wish, we could do the standard trick of committing the generated XSD (or etc) to our git tree, and then having a linter test that checks that the generated file is up-to-date. That way, only developers are exposed to the conversion tool, instead of end users.

rpoyner-tri commented 1 year ago

Probably then some combo of jing (command-line validator), trang (schema format converter), and validation library code from libxml2.

Brew offers a jing-trang package. Ubuntu has jing and trang as packages.

jwnimmer-tri commented 1 year ago

Anything we use for linting would be OK to come from a package manager, if we like. Anything in libdrake.so we'll be rebuilding from source (vendored) no matter what.

rpoyner-tri commented 1 year ago

Reflecting on the vendored-source bit, I wonder which we'd rather consume: libxml++/libxml2 stack (maybe big, but probably somewhat well-behaved), or rnv (tiny, but needing a variety of patches).

RobotLocomotion / drake

URDF parser: emit diagnostics for typos, malformed documents, etc. #16997

20193 has the beginnings of a Relax NG compact syntax (RNC) schema. It doesn't yet fully check the `drake:` custom tags, but it could pretty easily do so.

RobotLocomotion / drake

URDF parser: emit diagnostics for typos, malformed documents, etc. #16997

20193 has the beginnings of a Relax NG compact syntax (RNC) schema. It doesn't yet fully check the drake: custom tags, but it could pretty easily do so.

20193 has the beginnings of a Relax NG compact syntax (RNC) schema. It doesn't yet fully check the `drake:` custom tags, but it could pretty easily do so.