enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.39k stars 323 forks source link

Add metadata version #11479

Open 4e6 opened 2 weeks ago

4e6 commented 2 weeks ago

Extracted from the https://github.com/enso-org/enso/pull/11390#issuecomment-2453808096

Issue

Add version to the metadata section of the file. Implementation could be

Separate section

Relates to the whole metadata section. Can change the way the rest of the metadata is parsed.

#### METADATA ####
{"version":1}
[]
{}

Encoded in METADATA string

Same as the previous but different encoding.

#### METADATA v1 ####
[]
{}

Extend the metadata

Semantically relates to the last metadata line but does not require changing the parser.

#### METADATA ####
[]
{"version":1,"ide":{}}

Make the METADATA section a comment

Make the metadata section an Enso comment since we're changing the parser anyway.

#### METADATA ####
  {"version":1}
  []
  {}
kazcw commented 2 weeks ago

External metadata

There are a few reasons to have the metadata inside the file as we do now:

I think that each of these reasons could be (or already has been) addressed without needing the metadata to be part of the file.

Moving the metadata out of the file would enable large efficiency improvements--for example, it removes the need for the format to be text-safe.

Atomicity

Atomicity is less of a concern now that the metadata format is resilient--if the metadata and the source file end up slightly out of sync (e.g. due to a sudden process exit), this should cause little or no disruption to metadata usability.

Transparency

The metadata format has never been very human-readable; we can probably address this use case better with improved tooling.

Portability

Portability can be achieved without keeping all the data in the file; we just need unique file IDs:

# ENSO file-id: 44e510 #

In the metadata database, we would look up metadata primarily by file-id. Each file-id would have one "origin" FS path; if we find a file-id at a different path, we read the data according to the claimed file-id, then we assign a new ID for the new path--this way copies would share data when initially read, but evolve independently.

This approach would allow metadata to "follow" moved or copied files as well as it does now, within a local filesystem. It wouldn't work when sending a file between computers, but the IDE cannot operate on one file in isolation anyway; users already need to import/export project, in which case we could include metadata in the project file.

Side-note: Comment type

If we want metadata to be ignored by the parser without the parser needing to recognize it specifically, a doc comment (starting with ##) is the wrong kind of comment. During translation we do some work to assemble a doc comment into an abstracted text string, and then we place it in the IR; in the future we are likely to introduce a warning for unused documentation. Plain comments are more "ignored"--the parser represents them exactly, and does nothing else with them.

JaroslavTulach commented 2 weeks ago

External metadata

I'd rather move forward by smaller steps. E.g. versioning and (being a) comment to begin with any other changes later.

_If we want metadata to be ignored by the parser without the parser needing to recognize it specifically, a doc comment (starting with ##) is the wrong kind of comment.

OK, so what do you suggest? Is:

#*** META-DATA 2.0 ***#
  [json1...]
  [json2...]

better?

farmaazon commented 2 weeks ago

My twopenny:

  1. Transparency - currently is not readable, but we had talks about making it better, for example identifying nodes by name and put it in YAML format.
  2. Portability - the proposed solution still breaks if someone is sending to their friend just a file, without a project. A
  3. Also, there is a problem with the version control - the file metadata also should be versioned, and having it in a single file simplifies any action (moving, checkout, etc.).

Of course, we can solve those problems, but I personally I don't see any efficiency improvements so large it would justify the effort.