SixArm / usv

Unicode Separated Values (USV) data markup for units, records, groups, files, streaming, and more.
207 stars 4 forks source link

Always make the last separators mandatory #2

Open cipriancraciun opened 2 years ago

cipriancraciun commented 2 years ago

First of all, given there is no clear specification, I interpret the current USV as described in issue #1.

Thus, my suggestion is to make the unit / record / group / file separators mandatory at the end of each such block.

The reasons are:

And, if those are not convincing enough, here is a practical reason: it's simpler to write the formatter, because one can just print the last separator without checking if this was indeed the last item in its block:

for f in files :
  for g in f.groups :
    for r in g.records :
      for u in r.units :
        print(u.value)
        print(US)
      print(RS)
    print(GS)
  print(FS)

(I'll leave to others to think about the implementation where the last separator is not mandatory.) :)

joelparkerhenderson commented 2 years ago

Yes your writeup is excellent. In practice, I see two additional issues that are related to your points. What I'd like to do is keep this issue open and use it for discussion because I'm 100% aiming to standardize and do a BNF and similar, and this repo is helping to find corner cases and to find advice.

  1. CSV and TSV files often end with a newline, which makes these formats easier to edit with a typical line-oriented editor, and also easier to commit to repositories that require every text file to have a final newline, or that use line-oriented merge tools that flag a missing final newline. In practice, a USV format user will often encounter a final newline to deal with, or delete, because of line-oriented Unix tools. An open question is how much of a pain point it would be to enforce zero trailing newline in a typical developer's editor.

  2. The primary use cases that I've seen so far in the past few years I've been working with USV is for units and records (a.k.a. columns and rows), not groups and files (a.k.a. tables and schemas). So I believe it's highly desirable in practice to have USV work with the unit separator and the record separator, without any group separator or file separator. An open question is the tradeoff between developer ergonomics in a typical editor versus a simpler parser.

What are your thoughts about these?

cipriancraciun commented 2 years ago

Your point (1) (about newlines) I think is more related to issue #3. (I'll reply to it there.)

The primary use cases that I've seen so far in the past few years I've been working with USV is for units and records (a.k.a. columns and rows), not groups and files (a.k.a. tables and schemas). So I believe it's highly desirable in practice to have USV work with the unit separator and the record separator, without any group separator or file separator. An open question is the tradeoff between developer ergonomics in a typical editor versus a simpler parser.

I also believe that any "tabular format" will deal in 99.999% of the cases only with one table per file. Thus, in terms of specifications, there are two major choices to be made (which are most of the time conflicting):


That being said, I think in the case of USV these are the most sensible choices:


I'll try to tackle a bit the third choice (i.e. going back to the drawing board with groups / files).

My assumption is that groups (and files) were meant to support multiple tables in the same spreadsheet, and multiple spreadsheets respectively.

However, currently USV misses one important feature of these, namely how to identify which group / file is which? I.e. table / spreadsheets titles.

So perhaps one could rework how groups / files work by introducing some missing features, and perhaps by dropping the symmetry with units / records.

For example (and this is not something I've thoroughly thought about) how about this new syntax:

USV := file + | group + | records
file := FS <file name> US <file description> RS ( ( group ) * | records )
group := GS <group name> US <group description> RS records
records := ( record ( RS record ) * ) ?
record := ( unit ( US unit ) * ) ?

Namely, files and groups are introduced by FS / GS, meanwhile records / units are joined (or in my #2 proposal terminated) by RS / US. Moreover a USV can contain either multiple files, multiple groups, or just records in an unnamed file; then a file can contain multiple groups, or just records in an unnamed group. The US and RS are reused by files and groups to denote the name and description.

It's not as nice as the initial specification, but it does support (without ambiguity) the case of just records, just groups, files with just records, files with groups.

Also this second proposal does suffer from the same truncation issue as described in #2, thus perhaps a group terminator and file terminator might be useful, as in:

file := FS <file name> US <file description> RS ( ( group ) * | records ) FS
group := GS <group name> US <group description> RS records GS

I.e. two adjacent files would be joined by FS FS as would two adjacent groups by GS GS.

joelparkerhenderson commented 2 years ago

Lots of info below... I'm hoping I'm responding to each of your points because I very much appreciate your insights.

in security there is the rule of "canonical representation", thus especially if one were to sign an USV file, there should be a canonical representation (enforced by the parser)

100% agree.

many parsers would most likely be lenient and just ignore a RS that is immediately followed by a GS or FS

This must be a hard error i.e. the entire parse must be invalid.

TODO: add this to the docs.

at the moment the empty string is a valid USV; how should it be interpreted?

This must have a spec.

The complement also must have a spec e.g. given a blank spreadsheet, what must the USV export be?

TODO: spec this.

given that separators are not mandatory, any file (that is an UTF-8 valid one) that doesn't contain separators is a valid USV file with a single file/group/record/unit;

You're correct this is an issue.

How does these issues interact with similar data exchange formats?

I believe you're honing in on a tension of these options:

detecting a truncated file

How about delegating this to a checksum that's out of scope of USV?

Detecting unexpected file truncation, or other kinds of unexpect corruption, are big scope increase (IMHO) for a simple format.

at the moment a single value-1value-2 is a valid USV

Yes, and real world cases that have come up somewhat-often where the content is solely units, never records.

In practice, the big ones so far have involved logging:

Worth mentioning, the real world cases are somehat-often using different dimensions meaning each record is using a different number of units. In other words, the data isn't an X,Y grid. A typical example is walking file systems, where directories (which are treated as USV records) can have a different numbers of entries (which are treated as USV units).

That being said, I think in the case of USV these are the most sensible choices:

  1. don't support groups and files at all;
  2. make the separators mandatory as this issue initially suggested;
  3. go back to the drawing board and see if there isn't a better way to support groups / files;

I agree with your choices.

1 is not viable because the groups are must-have in practice, in order to be able to export a typical database set of schemas, or a typical Excel spreadsheet set of folios. The real world use case is import/export all the data, which is then slurped into another system that knows enough about the data structure. For import/export where the other system doesn't know enough about the data, we use a typical Postgres database dump (including metadata, table layouts, etc.), or a zip file of Excel .xls files (including metadata, macros, etc.).

2 I want to think more about this

3 Likewise

here is a practical reason: it's simpler to write the formatter, because one can just print the last separator without checking if this was indeed the last item in its block:

I would describe that style of loop as using content "terminators" or "trailing separators", rather than content "splitters" a.k.a. "in-between separators".

This feels akin to C style null terminated strings.

My intuition is there are large advantages to this approach, such as for streaming data-- a stream source can output a unit and its terminator, without needing to be aware of whether there's a next unit coming. What would you do to trigger the start-of-file or start-of-group or start-of-record or start-of-unit?

OTOH, it's a totally different approach than CSV, TSV, ASV, all of which use in-between separators.

USV misses one important feature of these, namely how to identify which group / file is which? I.e. table / spreadsheets titles.

Yes. In practice this hasn't been an issue because the reader and writer both pre-agree on the overall data structure. In other words, USV hasn't yet aimed to reconstitute table names, nor even table column headers. For example, USV doesn't specify that a record's first row is the column names. Whenever we've needed to reconstitute the data structure, we've switched from USV to more-powerful formats (e.g. Postgres dump, Excel zips, etc. as above).