Included/Imported schema files not found

mbeckerle commented 2 years ago

Something is up with class path.

We have many schemas where one schema depends on another. E.g., PCAP which depends on ethernetIP.

When I try to use daffdoil-vscode to debug PCAP, I specify the pcap.dfdl.xsd schema, then the icmp.cap data file, and then I get an error indicating daffodil is unable to resolve the inter-schema references.

mbeckerle commented 2 years ago

There are a few different aspects to this problem.

First, locally, schemas are preferably organized using the standard schema layout. This means the schema files will be spread out over src/main/resources and src/test/resources. E.g., the schema files under src/main/resources may not define any global elements at all, just types. The global elements may only be defined under src/test/resources.

This means the schema is structurally assuming that both src/main/resources and src/test/resources are on the class path when testing.

The second aspect is about one schema depending on another. The whole transitive closure of the chain of inter-schema dependencies is captured in the build.sbt file, and managed dependencies are used when running 'sbt test' to obtain all the schemas upon which this one depends (transitively).

We need some way for the debugger to inherit this as well. For other IDEs, either they support parsing and processing the sbt file themselves, or they utilize sbt as a server of some sort so they can be fed the dependency information, OR, a command such as 'sbt eclipse' must be issued which writes out all the dependencies (transitively) to a file format suitable for the IDE (in this case Eclipse).

However, we do it, we need some way to capture the inter-component dependencies without requiring users to do much work to set it up.

Just setting this up for using the Daffodil CLI is pretty miserable. You have to edit these massive classpath entries, then log-out/login to get them to take effect, etc. Which is why I almost always invoke Daffodil via 'sbt test'.

mbeckerle commented 2 years ago

Further point: the order of things on the classpath matters very much. src/test/resources needs to be earlier on the classpath than src/main/resources.

Secondly, there are things on the classpath other than just schema file jars or xsd file. There are jars containing user-definied function definitions, or jars containing validators, or jars containing pluggable layer transformers.

All those things are handled via sbt managed dependencies so that 'sbt test' finds them all for you based on them being declared in the build.sbt, and available via some binary server, or in the local m2 or ivy2 cache.

mbeckerle commented 2 years ago

According to https://stackoverflow.com/questions/2942536/is-the-java-classpath-final-after-jvm-startup it's not possible to change the class path dynamically after JVM startup except by implementing a class loader.

Assuming we don't want to implement a class loader (I have no idea how hard that is) this would mean the vscode front end would want to spin up a distinct server for each DFDL schema being debugged so as to establish the project class path properly for it based on all its dependencies.

jw3 commented 2 years ago

@mbeckerle - revisiting this after seeing issue #57; do you think this issue remains a blocker for the 1.0.0 milestone?

mbeckerle commented 2 years ago

The ethernetIP schema at DFDLSchemas on github is the one I normally point people at who are trying to learn DFDL. Many engineers have familiarity with IP packets and such so it's a really useful example.

It is divided into 3 files as best-practice. The base format is in one file, the primary schema in another, and this complex separate ipAddress is in a separate file.

Debugging ethernetIP really does need to work without having to combine files or move files around. I suggest using that as the example for development as I think we've gotten as far as we can with the single-file jpeg DFDL schema.

arosien commented 2 years ago

Sorry to be out of touch, I've been on furlough and also we just had a new member of the family arrive.

I didn't initially understand the connection between the classpath and schema import resolution, but then I found the explicit link via sbt between the PCAP schema (in one GitHub repo) and the dependent schema (in another GitHub repo).

As a user, I'm not sure the name of the environment variable makes sense because it references the classpath, which isn't really a schema-related concept. Perhaps rename this to some kind of "import path"?

mbeckerle commented 2 years ago

I created https://issues.apache.org/jira/browse/DAFFODIL-2616 for the missing documentation of how this include/import, jar packaging, and classpath stuff works.

The reason we use the classpath explicitly, is that the "normal" packaging of DFDL schemas is to put them into jar files for distribution just like any kind of software component. Then managed dependencies will grab them (from maven central, or your favorite binary server) based on maven/sbt dependencies.

These jar files not only contain the schema itself, but can contain scala or java code for User-defined functions the schema requires, and as of 3.2.1 scala code (well, class files) for layering transformations. So they really are classpath jars in the java/scala JVM sense.

We do need real end-user documentation on how the daffodil resolver implements the import/include schemaLocation attribute.

apache / daffodil-vscode

Included/Imported schema files not found #54