OpenEnergyPlatform / omi

Repository for the Open Metadata Integration (OMI). For metadata definition see metadata repo:
https://github.com/OpenEnergyPlatform/metadata
GNU Affero General Public License v3.0
7 stars 4 forks source link

Use jsonschema specification for validation #26

Closed MGlauer closed 1 year ago

MGlauer commented 4 years ago

Closes #20

4lm commented 4 years ago

@MGlauer,

I see problems with this .gitmodules approach. For me the library delivery mechanism are PyPI packages and not hard-coupled (1, 2) files in folders. Because the folder structures are defined as fixed strings and are also not versioned (they allways can change), this approach is IMO not good. For example the name of the repo/modules(folders) will maybe change in the future. IMO a repo is not an archive, we should be able to refactor our code and not worry about external linking.

Why not use the package "metadata" from PyPI?

PS: IMO if one would like to hard-link a fixed version of a file, one could use a tagged version, like: https://raw.githubusercontent.com/OpenEnergyPlatform/metadata/v1.0.1/metadata/v140/schema.json

MGlauer commented 4 years ago

The answer is rather simple:

The schema should (imo) not be a python package. It is a specification that should be independent of a specific programming language. Using it as a python module might work for us - but what about other modellers that want to use the metadata standard? How would they import it into their projects that may not be based on Python? Building an infrastructure that is convenient for us, but not others is not a good approach for an open project - especially if we want to establish it as a standard.

But I agree that submodules are not a proper solution because of their general wackiness, but I did it this way for a specific reason: Tools like OMI have to use different versions of the schema at the same time, as they support several metadata versions. And keeping several versions as done in the metadata repo goes against the spirit of version control. It should be possible to specify a certain tag or branch for a submodule. But it seems this is only supported by the newest versions of git that are not there on many systems that are not arch :/

Doing the same with the files as you propose would mean, that developers have to download multiple files and manage them manually the whole time.

We may need a completely different approach here...

4lm commented 4 years ago

We may need a completely different approach here...

Mhh, now I'm undecided, because there are IMO good arguments on both sides. In the following I try to reflect on this, maybe we find another good arguments and a way on how to proceed ...

The schema should (imo) not be a python package. It is a specification that should be independent of a specific programming language.

I agree, a schema should not be solely a Python package and should work on it's own, but if there is a specific language package for a schema, why not use it? It has the advantage, that it gives a local copy of the the schemas, without the need to handle them manually, gives you a standard way to deserialize the json schema files into a python object and doesn't break your code if there are changes on the schema or the file and folder structure of the schema. On the other hand, this approach the disadvantage that you regularly have to update the Python package the get not out of sync (but it gives you full control over your own code base and doesn't break it without notice).

Using it as a python module might work for us - but what about other modellers that want to use the metadata standard? How would they import it into their projects that may not be based on Python? Building an infrastructure that is convenient for us, but not others is not a good approach for an open project - especially if we want to establish it as a standard.

The metadata repo can be used in both ways, pure URLs manually (which have to be treated/checked) or convenient language packages (right now only Python, others could contribute other language implementations). IMO the conflict lies not in pure json usage or a specific language implementation for convenient use. It lies in the problem that GitHub is not an archive, code/project names/paths and with this URLs might change. Maybe we should reconsider putting the schemas as artefacts of this repo on zenodo. This would give us updatable Resources on URLs with version history, checksums and DOIs (which IMO the scientific community would love for citation). This would IMO also help "to establish it as a standard".

Tools like OMI have to use different versions of the schema at the same time, as they support several metadata versions.

That's IMO already solved by having v130/schema.json, v140/schema.json. The problem right now IMO is more, that we are not completly done with specifying them out and so we have file changes on specific version (v140, v130) and IMO even if we think we are done with specifing something we never will be. If we patch a version we should also change the semver version from for example v130 to v131. Maybe for this, folder names (-> URLs) would have to change from for example v130 to v1.3.0 to be fully semver compliant, for example the 17th patch of v1.3.0 would then be v1.3.17. Or maybe even better, we should only have major and minor in the path name, for example v1.3 and only specify the full semver version in the json, because otherwise we have to many updates (URLs) to communicate. This gives us and the user complete control over versioning, either on the resource endpoint via URL or in a language package. This is independent of the question zenodo yes/no.

And keeping several versions as done in the metadata repo goes against the spirit of version control.

I don't agree, the schemas itself are not program code they are standards and if we want to be able to support different versions of a standard in our apps we should have for every major and minor version a different resource locator (folder -> URL endpoint). Not for the patches, they are in semver a "version when you make backwards compatible" change - we just have to use semver correctly - that's the hard part IMO ;)

But I agree that submodules are not a proper solution because of their general wackiness [...] It should be possible to specify a certain tag or branch for a submodule. But it seems this is only supported by the newest versions of git that are not there on many systems that are not arch :/

You are right the git approach seems wacky, personally I wouldn't use it. I always would use local copies of the schemas I want to use in my code base and would check in my tests if there is a new version on the origin URL. That's for example what the Python code base is doing with the metaschema our schemas are depending on.

Doing the same with the files as you propose would mean, that developers have to download multiple files and manage them manually the whole time.

Yes, to integrate the schemas you use in your own code base and to handle them manullay is IMO a good and preferred thing! You want full control and not linked resources that suddenly and without notice might break your code or render your data invalid on schema changes. That's why there is for all python users a the python package, which handles this for you, but could be optimized by only presenting the major and minor versions as constants:

METADATA_V130_SCHEMA -> METADATA_V1_3_SCHEMA

This way your code base can gradually grow, can support multiply major/minor versions and doesn't break. If you want to do this without the Python package fine, but then you would have IMO do this all yourself in omi, why not use the python package which is doing this already and after the above mentioned changes will make changes to your code base not necessary for patch versions, so your code base will only grow to support new versions and is backwards compatible with old ones. Don't understand me wrong, a pure JSON version reachable via URLs (that don't change) has to work (zenodo to the rescue?). But if there is a language implementation, which is doing all the steps you should do, why not using that?

What do you think? How shall we proceed? Maybe @christian-rli has also an opinion on this as well?

Edit: Maybe @Bachibouzouk has also an opinion on this?

Bachibouzouk commented 4 years ago

Doing the same with the files as you propose would mean, that developers have to download multiple files and manage them manually the whole time.

Meta git suggestion : use and edit the first thread of this discussion to quickly inform the reader of the current stand :)

4lm commented 4 years ago

@MGlauer another question, the checks are failing, travis says a .git file or folder is missing:

lists of files in version control and sdist do not match!
missing from VCS:
  src/omi/dialects/oep/spec/.git
ERROR: InvocationError for command /home/travis/build/OpenEnergyPlatform/omi/.tox/check/bin/check-manifest . (exited with code 1)

Any idea?

MGlauer commented 4 years ago

Yes, this seems to be related to the submodule - one more reason against it.

But as this PR seems rather problematic and does not yield any actual functionality, I would like to postpone it for the upcoming release.

4lm commented 4 years ago

But as this PR seems rather problematic and does not yield any actual functionality, I would like to postpone it for the upcoming release.

OK, roger that!

jh-RLI commented 2 years ago

I think we should provide an API endpoint that serves the schema.json. This would provide a stable oep related URL and would enable other programming languages to access the schema.json.

Also, something related to the python package: I noticed that the JSON files currently are not included in the pypi package.

jh-RLI commented 1 year ago

This is reimplemented in #63