apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 980 forks source link

DRILL-8453: Add XSD Support to XML Reader (Part 1) #2824

Closed cgivre closed 1 year ago

cgivre commented 1 year ago

DRILL-8453: Add XSD Support to XML Reader (Part 1)

Description

This PR is a part of a series to add better support for reading XML data to Drill. One of the main challenges is that XML data does not have a way of inferring data types, nor does it have a way of detecting arrays.
The only way to do this really well is to have a schema. Some XML files link a schema definition file to the data. This PR adds the capability for Drill to map XSD schema files into Drill schemas.

The current plan is as follows:

  1. Part 1 of this PR simply adds the reader but adds no new user detectable functionality.
  2. Part 2 will include the actual integration with the XML reader.
  3. Part 3 will include the ability to read arrays in the actual XML reader.

Documentation

No user facing changes.

Testing

Added new unit tests.

cgivre commented 1 year ago

@mbeckerle

mbeckerle commented 1 year ago

Sorry bogged down. Will review soon.

mbeckerle commented 1 year ago

I'm ok with merging this. It's still a bit of a work-in-progress (hence the Part 1)

Some TODOs in here are mine. I do intend to get to them, but no reason to hold up this change set for that.

I highly recommend that you squash these 15 commits together into one coherent commit rather than commit all 15 as is.

cgivre commented 1 year ago

@jnturton Are we good to go?

cgivre commented 1 year ago

@mbeckerle We always squash commits for Drill PRs :-) I think the TODOs are ok here since this is part 1.

cgivre commented 1 year ago

@jnturton I fixed imports. @mbeckerle I added one exception which removed a TODO.