apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 984 forks source link

DRILL-8474: Add Daffodil Format Plugin #2836

Closed mbeckerle closed 3 months ago

mbeckerle commented 9 months ago

Adding Daffodil to Drill as a 'contrib'

Requires Daffodil 3.7.0-SNAPSHOT which has metadata support we're using.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.

Tests show Daffodil works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

https://github.com/apache/drill/issues/2835

cgivre commented 9 months ago

@mbeckerle Looks like you're making good progress!

mbeckerle commented 7 months ago

This is pretty much working now, in terms of constructing drill metadata from DFDL schemas, and Daffodil delivering data to Drill.

There were dozens of commits to get here, so I squashed them as they were no longer helpful.

Obviously more test are needed, but the ones there show nested subrecords working.

The issues like how schemas get distributed, and how Daffodil gets invoked in parallel by drill are still open.

mbeckerle commented 7 months ago

Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time)

Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work.

cgivre commented 7 months ago

Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time)

Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work.

@mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release?

mbeckerle commented 7 months ago

Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work.

@mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release?

We could have a Daffodil release in Jan or Feb. There are some Daffodil API cleanups that need to be discussed that would provide better stability for this Drill integration ... we may want to wait for those and update this to use them.

cgivre commented 7 months ago

Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work.

@mbeckerle This is really great work! Thanks for your persistence on this. Do you have a an ETA on the next Daffodil release?

We could have a Daffodil release in Jan or Feb. There are some Daffodil API cleanups that need to be discussed that would provide better stability for this Drill integration ... we may want to wait for those and update this to use them.

@mbeckerle So is the next step really to figure out how to access the Daffodil files from a potentially distributed environment?

mbeckerle commented 6 months ago

@cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it.

I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help.

paul-rogers commented 6 months ago

Hi Mike,

Just jumping in with a random thought. Drill has accumulated a number of schema systems: Parquet metadata cache, HMS, Drill's own metastore, "provided schema", and now DFDL. All provide ways of defining data: be it Parquet, JSON, CSV or whatever. One can't help but wonder, should some future version try to reduce this variation somewhat? Maybe map all the variations to DFDL? Map DFDL to Drill's own mechanisms?

Drill uses two kinds of metadata: schema definitions and file metadata used for scan pruning. Schema information could be used at plan time (to provide column types), but certainly at scan time (to "discover" the defined schema.) File metadata is used primarily at plan time to work out how to distribute work.

A bit of background on scan pruning. Back in the day, it was common to have thousands or millions of files in Hadoop to scan: this was why tools like Drill were distributed: divide and conquer. And, of course, the fastest scan is to skip files that we know can't contain the information we want. File metadata captures this information outside of the files themselves. HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is evidently based on HMS.)

For example, Drill's Parquet metadata cache, the Drill metastore and HMS all provide both schema and file metadata information. The schema information mainly helped with schema evolution: over time, different files have different sets of columns. File metadata provides information about the file, such as the data ranges stored in each file. For Parquet, we might track that '2023-01-Boston.parquet' has data from the office='Boston' range. (So, no use scanning the file for office='Austin'.) And so on.

With Hadoop HFS, it was customary to use directory structure as a partial primary index: our file above would live in the /sales/2023/01 directory, for example, and logic chooses the proper set of directories to scan. In Drill, it is up to the user to add crufty conditionals on the path name. In Impala, and other HMS-aware tools, the user just says WHERE order_year = 2023 AND order_month = 1, and HMS tells the tool that the order_year and order_month columns translate to such-and-so directory paths. Would be nice if Drill could provide that feature as well, given the proper file metadata: in this case, the mapping of column names to path directories and file names.

Does DFDL provide only schema information? Does it support versioning so that we know that "old.csv" lacks the "version" column, while "new.csv" includes that column? Does it also include the kinds of file metadata mentioned above?

Or, perhaps DFDL is used in a different context in which the files have a fixed schema and are small in number? This would fit well the "desktop analytics" model that Charles and James suggested is where Drill is now most commonly used.

The answers might suggest if DFDL can be the universal data description. or if DFDL applies just to individual file schemas, and Drill would still need a second system to track schema evolution and file metadata for large deployments.

Further, if DFDL is kind of a stand-alone thing, with its own reader, then we end up with more complexity: the Drill JSON reader and the DFDL JSON reader. Same for CSV, etc. JSON is so complex that we'd find ourselves telling people that the quirks work one way with the native reader, another way with DFDL. Plus, the DFDL readers might not handle file splits the same way, or support the same set of formats that Drill's other readers support, and so on. It would be nice to separate the idea of schema description from reader implementation, so that DFDL can be used as a source of schema for any arbitrary reader: both at plan and scan times.

If DFDL uses its own readers, then we'd need DFDL reader representations in Calcite, which would pick up DFDL schemas so that the schemas are reliably serialized out to each node as part of the physical plan. This is possible, but it does send us down the two-readers-for-every-format path.

On the other hand, if DFDL mapped to Drill's existing schema description, then DFDL could be used with our existing readers and there would be just one schema description sent to readers: Drill's existing provided schema format that EVF can already consume. At present, just a few formats support provided schema in the Calcite layer: CSV for sure, maybe JSON?

Any thoughts on where this kind of thing might evolve with DFDL in the picture?

Thanks,

On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle @.***> wrote:

@cgivre https://github.com/cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it.

I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help.

— Reply to this email directly, view it on GitHub https://github.com/apache/drill/pull/2836#issuecomment-1874213780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA . You are receiving this because you were mentioned.Message ID: @.***>

mbeckerle commented 6 months ago

Let me respond between the paragraphs....

On Tue, Jan 2, 2024 at 11:49 PM Paul Rogers @.***> wrote:

Hi Mike,

Just jumping in with a random thought. Drill has accumulated a number of schema systems: Parquet metadata cache, HMS, Drill's own metastore, "provided schema", and now DFDL. All provide ways of defining data: be it Parquet, JSON, CSV or whatever. One can't help but wonder, should some future version try to reduce this variation somewhat? Maybe map all the variations to DFDL? Map DFDL to Drill's own mechanisms?

Well we can dream can't we :-)

I can contribute the ideas in https://daffodil.apache.org/dev/design-notes/Proposed-DFDL-Standard-Profile.md which is an effort to restrict the DFDL language so that schemas written in DFDL can work more smoothly with Drill, NiFi, Spark, Flink, Beam, etc. etc.

DFDL's data model is too restrictive to be "the model" for Drill since Drill wants to query even unstructured data like XML without schema. DFDL's data model is targeted only at structured data.

Drill's data model and APIs seem optimized for streaming block-buffered top-level rows of data (the EVF API does anyway). Top level row-sets are first-class citizens, as are the fields of said rows. Fields containing arrays of maps (possibly containing more arrays of maps, and so on deeply nested) are not handled uniformly with the same block-buffered "row-like" mechanisms. The APIs are similar, but not polymorphic. I suspect that the block-buffered data streaming in Drill only happens for top-level rows, because there is no test for whether or not you are allowed to create another array item like there is a test for creating another row in a row-set writer. There is no control inversion where an adapter must give back control to Drill in the middle of trying to write an array.

The current Drill/Daffodil interface I've created doesn't cope with header-body* files (ex: PCAP which format has a header record, then repeating packet records) as it has no way of returning just the body records as top level rows. So while there exists a DFDL schema for PCAP, you really do want to use a dedicated PCAP Drill adapter which hands back rows, not Daffodil which will parse the entire PCAP file into one huge row containing a monster sub-array of packets, where each packet is a map within the array of maps. This is ok for now as many files where DFDL is used are not like PCAP. They are just repeating records of one format with no special whole-file header. Eventually we will want to be able to supply a path to tell the Drill/Daffodil interface that you only want the packet array as the output rows. (This is the unimplemented Daffodil "onPath(...)" API feature. We haven't needed this yet for DFDL work in cybersecurity, but it was anticipated 10+ years back as essential for data integration.)

Drill uses two kinds of metadata: schema definitions and file metadata used for scan pruning. Schema information could be used at plan time (to provide column types), but certainly at scan time (to "discover" the defined schema.) File metadata is used primarily at plan time to work out how to distribute work.

DFDL has zero notion of file metadata. It doesn't know whether data even comes from a file or an open TCP socket. Daffodil/DFDL just sees a java.io.InputStream. The schema it uses for a given file is specified by the API call. Daffodil does nothing itself to try to find or identify any schema.

So we're "blank slate" on this issue with DFDL.

A bit of background on scan pruning. Back in the day, it was common to have thousands or millions of files in Hadoop to scan: this was why tools like Drill were distributed: divide and conquer. And, of course, the fastest scan is to skip files that we know can't contain the information we want. File metadata captures this information outside of the files themselves. HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is evidently based on HMS.)

For example, Drill's Parquet metadata cache, the Drill metastore and HMS all provide both schema and file metadata information. The schema information mainly helped with schema evolution: over time, different files have different sets of columns. File metadata provides information about the file, such as the data ranges stored in each file. For Parquet, we might track that '2023-01-Boston.parquet' has data from the office='Boston' range. (So, no use scanning the file for office='Austin'.) And so on.

With Hadoop HFS, it was customary to use directory structure as a partial primary index: our file above would live in the /sales/2023/01 directory, for example, and logic chooses the proper set of directories to scan. In Drill, it is up to the user to add crufty conditionals on the path name. In Impala, and other HMS-aware tools, the user just says WHERE order_year = 2023 AND order_month = 1, and HMS tells the tool that the order_year and order_month columns translate to such-and-so directory paths. Would be nice if Drill could provide that feature as well, given the proper file metadata: in this case, the mapping of column names to path directories and file names.

The above all makes perfect sense to me, and DFDL schemas are completely orthogonal to this. If a file naming convention tells Drill that it doesn't need to open and parse some data using Daffodil, great, then Drill will not invoke Daffodil to do so.

DFDL/Daffodil doesn't know nor care about this.

Does DFDL provide only schema information? Does it support versioning so that we know that "old.csv" lacks the "version" column, while "new.csv" includes that column? Does it also include the kinds of file metadata mentioned above?

DFDL only provides structural schema information.

Data formats do versioning in a wide variety of ways, so DFDL can't take any position on how this is done, but many DFDL schemas adapt to multiple versions of the data formats they describe based on the existence of different fields or values of those fields. This can only work for formats where there are data fields that identify the versions.

But nothing based on file metadata.

Or, perhaps DFDL is used in a different context in which the files have a fixed schema and are small in number? This would fit well the "desktop analytics" model that Charles and James suggested is where Drill is now most commonly used.

The cybersecurity use case is one of the prime motivators for DFDL work.

Often the cyber gateways are file movers, files arrive spontaneously in various locations, and are moved across the cyber boundary. The use cases continue to grow in scale, and some people use Apache NiFi with DFDL for large scale such file moving.

Unlike Drill, these use cases all parse and then re-serialize the data after extensive validation and rule-based filtering.

The same sort of file-metadata based stuff - ex: rules like all the files in this directory named X with extension ".dat" use schema S - all applies in the cyber-gateway use case.

Apache Daffodil doesn't know anything about this cyber use case however, nor anything about data integration. Daffodil is actually a quite narrow library. Stays in its lane.

The answers might suggest if DFDL can be the universal data description. or if DFDL applies just to individual file schemas, and Drill would still need a second system to track schema evolution and file metadata for large deployments.

Yeah. Drill needs a separate system for this. Not at all a DFDL-specific issue. DFDL/Daffodil take no position on schema evolution.

However, to Daffodil devs, a DFDL schema is basically source code. We keep them in git. They have releases. We package them in jars and use managed dependency tools to grab them from repositories the same way java code jars are grabbed by maven.

One of my concerns about metadata repositories/registries is that they are not thought of as configuration management systems. But DFDL schemas are certainly large formal objects that require configuration management.

For example, the VMF schema we have is over 180K lines of DFDL "code", spread over hundreds of files. It is actually an assembly composed of specific versions of 4 different smaller DFDL schemas and the large corpus of VMF-specific schema files. There is documentation, analysis reports, etc. that go along with it.

So some sort of repository that makes specific schemas available to Drill makes sense, but cannot be confused with the configuration management system.

I quite literally just got a Maven Central/Sonotype account yesterday so that I can push some DFDL schemas up to maven central so they can be reused from there via jars.

Further, if DFDL is kind of a stand-alone thing, with its own reader, then we end up with more complexity: the Drill JSON reader and the DFDL JSON reader. Same for CSV, etc. JSON is so complex that we'd find ourselves telling people that the quirks work one way with the native reader, another way with DFDL. Plus, the DFDL readers might not handle file splits the same way,

Daffodil knows no concept of "file splits". It doesn't even know about files actually. It's just an input byte stream. literally a java.io.InputStream.

or support the same set of formats that Drill's other readers support, and so on. It would be nice to separate the idea of schema description from reader implementation, so that DFDL can be used as a source of schema for any arbitrary reader: both at plan and scan times.

The DFDL/Drill integration converts DFDL-described data directly to Drill with no intermediate form like XML nor JSON. One hop. E.g.,

drillScalaWriter.setInt(daffodilInfosetElement.getInt());

There is no notion of Daffodil "also" reading JSON. You wouldn't parse JSON with DFDL typically. You would use a JSON library and hopefully a JSON schema that describes the JSON. Ditto for XML, Google protocol buffers, Avro, etc.

If DFDL uses its own readers, then we'd need DFDL reader representations in

DFDL is a specific reader, this notion of "its own readers" doesn't apply.

Calcite, which would pick up DFDL schemas so that the schemas are reliably serialized out to each node as part of the physical plan. This is possible, but it does send us down the two-readers-for-every-format path.

On the other hand, if DFDL mapped to Drill's existing schema description,

then DFDL could be used with our existing readers

I don't get "DFDL used with existing readers".... by "with" you mean "along-side" or "incorporating"?

and there would be just one schema description sent to readers: Drill's existing provided schema format that EVF can already consume. At present, just a few formats support provided schema in the Calcite layer: CSV for sure, maybe JSON?

This is what we need. The Daffodil/Drill integration walks DFDL metadata and creates Drill metadata 100% in advance and this should, I think, automatically find its way to all the right places without anything else being needed beyond today's Drill behavior.

But besides Drill's metadata the Daffodil execution at each node needs to load up the compiled DFDL schema. That object, which can be several megabytes of stuff. Needs to find its way out to all the nodes that need it. This I have no idea how we make happen.

Any thoughts on where this kind of thing might evolve with DFDL in the picture?

Thanks,

  • Paul

On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle @.***> wrote:

@cgivre https://github.com/cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it.

I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help.

— Reply to this email directly, view it on GitHub https://github.com/apache/drill/pull/2836#issuecomment-1874213780, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/apache/drill/pull/2836#issuecomment-1874845274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALUDA4H366DXIG2RATIV4TYMTPLHAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUHA2DKMRXGQ . You are receiving this because you were mentioned.Message ID: @.***>

mbeckerle commented 6 months ago

This is ready for a next review. All the scalar types are now implemented with typed setter calls.

The prior review comments have all been addressed I believe.

Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.
  2. Test of nilled values (and more tests generally to show deeply nested and repeating nested objects work.)
  3. Errors - revisit every place errors are detected or thrown to make sure these are being done the right way for DFDL schema compilation and runtime errors as well.
cgivre commented 6 months ago

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls.

The prior review comments have all been addressed I believe.

Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster.

Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: https://github.com/apache/drill/pull/574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

  1. Test of nilled values (and more tests generally to show deeply nested and repeating nested objects work.)
  2. Errors - revisit every place errors are detected or thrown to make sure these are being done the right way for DFDL schema compilation and runtime errors as well.
mbeckerle commented 6 months ago

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls. The prior review comments have all been addressed I believe. Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster.

Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar.

There are two user scenarios which we can call production and test.

  1. Production: binary compiled DFDL schema file + code jars for Daffodil's own UDFs and "layers" plugins. This should, ideally, cache the compiled schema and not reload it for every query (at every node), but keep the same loaded instance in memory in a persistant JVM image on each node. For large production DFDL schemas this is the only sensible mechanism as it can take minutes to compile large DFDL schemas.

  2. Test: on-the-fly centralized compilation of DFDL schema (from a combination of jars and files) to create and cache (to avoid recompiling) the binary compiled DFDL schema file. Then using that compiled binary file, as item 1. For small DFDL schemas this can be fast enough for production use. Ideally, if the DFDL schema is unchanged this would reuse the compiled binary file, but that's an optimization that may not matter much.

Kinds of objects involved are:

Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms.

Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files.

Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file.

It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things.

Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data.

Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime.

cgivre commented 6 months ago

@mbeckerle With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

cgivre commented 6 months ago

@mbeckerle I had a thought about your TODO list. See inline.

This is ready for a next review. All the scalar types are now implemented with typed setter calls. The prior review comments have all been addressed I believe. Remaining things to do include:

  1. How to get the compiled DFDL schema object so it can be loaded by daffodil out at the distributed Drill nodes.

I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster. Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?

Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar.

There are two user scenarios which we can call production and test.

  1. Production: binary compiled DFDL schema file + code jars for Daffodil's own UDFs and "layers" plugins. This should, ideally, cache the compiled schema and not reload it for every query (at every node), but keep the same loaded instance in memory in a persistant JVM image on each node. For large production DFDL schemas this is the only sensible mechanism as it can take minutes to compile large DFDL schemas.
  2. Test: on-the-fly centralized compilation of DFDL schema (from a combination of jars and files) to create and cache (to avoid recompiling) the binary compiled DFDL schema file. Then using that compiled binary file, as item 1. For small DFDL schemas this can be fast enough for production use. Ideally, if the DFDL schema is unchanged this would reuse the compiled binary file, but that's an optimization that may not matter much.

Kinds of objects involved are:

  • Daffodil plugin code jars
  • DFDL schema jars
  • DFDL schema files (just not packaged into a jar)
  • Daffodil compiled schema binary file
  • Daffodil config file - parameters, tunables, and options needed at compile time and/or runtime

Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms.

Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files.

Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file.

It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things.

Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data.

  • These binary files are tied to a specific version+build of Daffodil. (They are just a java object serialization of the runtime data structures used by Daffodil).
  • Once reloaded into a JVM to create a Daffodil DataProcessor object, that object is read-only so thread safe, and can be shared by parse calls happening on many threads.

Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime.

@mbeckerle Would you want to chat sometime next week and I can walk you through the UDF architecture? I don't know how relevant it would be, but you'd at least see how things are installed and so forth.

mbeckerle commented 6 months ago

@mbeckerle With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

@cgivre I believe the style issues are all fixed. The build did not get any codestyle issues.

cgivre commented 6 months ago

@mbeckerle With respect to style, I tried to reply to that comment, but the thread won't let me. In any event, Drill classes will typically start with the constructor, then have whatever methods are appropriate for the class. The logger creation usually happens before the constructor. I think all of your other classes followed this format, so the one or two that didn't kind of jumped out at me.

@cgivre I believe the style issues are all fixed. The build did not get any codestyle issues.

The issue I was referring to was more around the organization of a few classes. Usually we'll have the constructor (if present) at the top followed by any class methods. I think there was a class or two where the constructor was at the bottom or something like that. In any event, consider the issue resolved.

mbeckerle commented 6 months ago

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?

I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

cgivre commented 6 months ago

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree?

I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

mbeckerle commented 6 months ago

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree? I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

If I understand this correctly, if a jar is on the classpath and has drill-module.conf in its root dir, then drill will find it and read that HOCON file to get the package to add to drill.classpath.scanning.packages.

Drill then appears to scan jars for class files for those packages. Not sure what it is doing with the class files. I imagine it is repackaging them somehow so Drill can use them on the drill distributed nodes. But it isn't yet clear to me how this aspect works. Do these classes just get loaded on the distributed drill nodes? Or is the classpath augmented in some way on the drill nodes so that they see a jar that contains all these classes?

I have two questions:

(1) what about dependencies? The UDF may depend on libraries which depend on other libraries, etc.

(2) what about non-class files, e.g., things under src/main/resources of the project that go into the jar, but aren't "class" files? How do those things also get moved? How would code running in the drill node access these? The usual method is to call getResource(URL) with a URL that gives the path within a jar file to the resource in question.

Thanks for any info.

cgivre commented 6 months ago

@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree? I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist.

@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week.

If I understand this correctly, if a jar is on the classpath and has drill-module.conf in its root dir, then drill will find it and read that HOCON file to get the package to add to drill.classpath.scanning.packages.

I believe that is correct.

Drill then appears to scan jars for class files for those packages. Not sure what it is doing with the class files. I imagine it is repackaging them somehow so Drill can use them on the drill distributed nodes. But it isn't yet clear to me how this aspect works. Do these classes just get loaded on the distributed drill nodes? Or is the classpath augmented in some way on the drill nodes so that they see a jar that contains all these classes?

I have two questions:

(1) what about dependencies? The UDF may depend on libraries which depend on other libraries, etc.

So UDFs are a bit of a special case, but if they do have dependencies, you have to also include those JAR files in the UDF directory, or in Drill's 3rd party JAR folder. I'm not that good with maven, but I've often wondered about making a so-called fat-JAR which includes the dependencies as part of the UDF JAR file.

(2) what about non-class files, e.g., things under src/main/resources of the project that go into the jar, but aren't "class" files? How do those things also get moved? How would code running in the drill node access these? The usual method is to call getResource(URL) with a URL that gives the path within a jar file to the resource in question.

Take a look at this UDF. https://github.com/datadistillr/drill-geoip-functions This UDF has a few external resources including a CSV file and the MaxMind databases.

Thanks for any info.

mbeckerle commented 6 months ago

Ok, so the geo-ip UDF stuff has no special mechanisms or description about those resource files, so the generic code that "scans" must find them and drag them along automatically.

That's the behavior I want.

@cgivre What is "Drill's 3rd Party Jar folder"?

If a magic folder just gets dragged over to all nodes, and drill uses a class loader that arranges for jars in that folder to be searched, then there is very little to do, since a DFDL schema can be just a set of jar files containing related resources, and the classes for Daffodil's own UDFs and layers which are java code extensions of its own kind.

mbeckerle commented 3 months ago

This now passes all the daffodil contrib tests using the published official Daffodil 3.7.0.

It does not yet run in any scalable fashion, but the metadata/data interfacing is complete.

I would like to squash this to a single commit before merging, and it needs to be tested rebased onto the latest Drill commit.

mbeckerle commented 3 months ago

Creating a new squashed PR so as to avoid loss of the comments on this PR.