apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

DRILL-8474: Adding Daffodil to Drill as a contrib. #2909

Open mbeckerle opened 5 months ago

mbeckerle commented 5 months ago

DRILL-8474: Adding Daffodil to Drill as a contrib.

This PR replaces: https://github.com/apache/drill/pull/2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR.

Description

Requires Daffodil version 3.7.0 or higher.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.

Test show this works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

Documentation

TBD: feature is incomplete still. It will require substantial documentation for users.

Testing

See tests under src/test in the new daffodil contrib module.

shfshihuafeng commented 5 months ago

This fails its tests due to a maven checkstyle failure. It's complaining about Drill:Exec:Vectors, which my code has no changes to.

Can someone advise on what is wrong here?

/home/runner/work/drill/drill/exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MapBuilder.java:201:5: you should use '{} for if' construct

 if (Objects.isNull(parent)) {
    throw new IllegalStateException("Call to resume() on MapBuilder with no parent.");
}
mbeckerle commented 5 months ago

Tests are now failing due to these two things in TestDaffodilReader.scala

  String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/";

That's an absolute URI that is used to obtain access to the schema files in this statement:

  private String selectRow(String schema, String file) {
    return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " +
        "validationMode => 'true', " + " schemaURI => '" + schemaURIRoot + "schema/" + schema +
        ".dfdl.xsd'," + " rootName => 'row'," + " rootNamespace => null " + "))";
  }

This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select.

What should I be doing to arrange for these schema URIs to be found.

The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths.

cgivre commented 5 months ago

Hi Mike, Are you free at all this week? My apologies... We're in the middle of putting an offer on a house and my life is very hectic at the moment. Best, -- C

On Apr 28, 2024, at 10:11 PM, Mike Beckerle @.***> wrote:

Tests are now failing due to these two things in TestDaffodilReader.scala

String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/"; That's an absolute URI that is used to obtain access to the schema files in this statement:

private String selectRow(String schema, String file) { return "SELECT * FROM table(dfs.data/" + file + " " + " (type => 'daffodil'," + " " + "validationMode => 'true', " + " schemaURI => '" + schemaURIRoot + "schema/" + schema + ".dfdl.xsd'," + " rootName => 'row'," + " rootNamespace => null " + "))"; } This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select.

What should I be doing to arrange for these schema URIs to be found.

The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths.

— Reply to this email directly, view it on GitHub https://github.com/apache/drill/pull/2909#issuecomment-2081781546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKB7PT327D7FTY34D7Z6ULY7WT5DAVCNFSM6AAAAABG4LUKI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRG44DCNJUGY. You are receiving this because you commented.

mbeckerle commented 5 months ago

Hi Mike, Are you free at all this week? My apologies... We're in the middle of putting an offer on a house and my life is very hectic at the moment. Best, -- C

Lots of availability. I'll send you separate email.