Update APE-6 for subtype support

taldcroft commented 3 years ago

This updates APE-6 to add support for extended data types, in particular multidimensional fixed-dimension columns, variable-length arrays, and object columns. This is done primarily by adding a subtype data keyword and specifying the convention for using this keyword for new supported extended types.

The discussion in https://github.com/astropy/astropy/issues/11368 was a large driver for this update including details of the new specification.

Writing standards documents is not my forte, so comments are most welcome!

cc: @mbtaylor @mhvk @hamogu @dhomeier

hamogu commented 3 years ago

Looks good to me.

mbtaylor commented 3 years ago

@taldcroft, this basically looks good, but I have couple of suggestions for change.

First, that use of subtype is not restricted to string columns. I agree that most uses, including the array and object ones enumerated here, will be for string columns, but (a) I don't see any cost to relaxing this restriction, and (b) I can imagine possibilities where non-string subtypes might be desirable, e.g. datatype: float64, subtype: MJD; I don't know if python/numpy has specific datatypes for time values, but some languages or parsing environments might do. Usages like this don't have to be written into the APE now or ever, but if either ECSV v1.x or individual users decide something like that would be a good idea in future, it won't conflict with the existing specification.

So where the subtype keyword is introduced I would remove or replace the sentence "If subtype is defined the datatype must be set to string", and in the Subtype data section replace the sentence:

If table readers do not recognize the subtype then the column should be returned as a string.

with something like:

If table readers do not recognise or support the subtype then they may ignore it and use the datatype only.

A few other knock-on changes elsewhere in the text will be required for consistency.

Second, for the "object" subtype, I'd suggest renaming it to "json". This is more descriptive, but also more accurate: in your example file in this section, one of the cells has the value true, which is JSON, but is not a JSON object (which would have to be enclosed in curly brackets).

taldcroft commented 3 years ago

Relaxing the restriction on requiring string

Sounds fine and I agree with allowing flexibility for future enhancements.

Second, for the "object" subtype, I'd suggest renaming it to "json".

OK. This requires a small change to the merged ECSV PR (https://github.com/astropy/astropy/pull/11662).

mhvk commented 3 years ago

Now had a look. I like the addition very much. I also like the changes proposed by @mbtaylor - and am happy to quickly review the necessary follow-up to the astropy implementation.

taldcroft commented 3 years ago

@mbtaylor @mhvk - I think have addressed the comments and that this is ready for final review. It's worth viewing the rendered file since I picked up some problems I had not previously noticed.

taldcroft commented 3 years ago

@mbtaylor - my hope was that the "As a point of clarification" sentence would make things clear. I'm certainly coming from a Python or Java perspective where everything (scalars included) is an object. IMO the JSON spec choose poorly in labeling a dict/mapping as "object", so my preference is to stick with the existing text and more common usage of "object".

We can always come back to this choice of wording if it ends up causing confusion. It won't change any implementation details.

Thanks for the sharp eye I'll fix those. I did put each of the examples into the Python parser, but it turns out that since object is a valid numpy dtype having "object" there also works for Python.

mbtaylor commented 3 years ago

Fine, I'm happy with the existing text then. All OK by me.

mbtaylor commented 3 years ago

One other formatting error I spotted: in the second paragraph of the "Header Details" section, there is a long run of monospaced text after "start with the two characters"

taldcroft commented 3 years ago

From @mbtaylor:

"I have been creating some ECSV 1.0 test data, for example the attached file. You might disagree with some of the details, in particular blank values (adjacent delimiters in the file with no text between them) for some data types - that usage does seem to be supported by the astropy reader for at least some data types, though it's not discussed explicitly in APE6, and I'd argue it would be a good idea to allow blank (masked?) values for any column (though I might have said something different in the past). But in any case it would probably be a good idea for us to agree what is and is not legal, and change things if required, so my output handler is not writing files that the astropy input handler can't read, and ideally vice versa."

Good point about blank values as a marker for missing values. I didn't realize that this is not mentioned in APE-6, so that is definitely something to fix in this PR.

Right now astropy supports blank values for all scalar column types. Here is the astropy Table representation of just the scalar type data in test file from @mbtaylor:

In [4]: t.pprint_all()
i_index s_byte s_short s_int s_long s_float s_double    s_string    s_boolean
------- ------ ------- ----- ------ ------- -------- -------------- ---------
      0      0       0     0      0     0.0      0.0           zero     False
      1     --       1     1      1     1.0      1.0            one      True
      2      2      --     2      2     2.0      2.0            two     False
      3      3       3    --      3     3.0      3.0          three      True
      4      4       4     4     --     4.0      4.0           four     False
      5      5       5     5      5     nan      5.0           five      True
      6      6       6     6      6     6.0      nan            six     False
      7      7       7     7      7     7.0      7.0             --      True
      8      8       8     8      8     8.0      8.0 ' "\""' ; '&<>        --
      9      9       9     9      9     nan      nan             --      True
     10    -10     -10   -10    -10   -10.0    -10.0             10     False
     11     --     -11   -11    -11   -11.0    -11.0       10 + one      True
     12    -12      --   -12    -12   -12.0    -12.0       10 + two     False
     13    -13     -13    --    -13   -13.0    -13.0     10 + three      True
     14    -14     -14   -14     --   -14.0    -14.0      10 + four     False
     15    -15     -15   -15    -15     nan    -15.0      10 + five      True

For the subtype columns, IIRC we had discussed this <somewhere> but I can't seem to find it. Basically I said that for any subtype column, a blank value is not allowed to signify missing data. In particular for fixed or variable array data, missing values should be represented as null. So instead of using a blank field, use [null,null]. The logic behind this is that being "missing" really applies to just a scalar element.

Again, none of this is reflected in this PR, so I'll add text accordingly if this seems OK with you.

taldcroft commented 3 years ago

@mbtaylor - thinking some more, I do see the improved overall consistency in allowing a blank field to indicate all values are missing even for subtype columns. I'll confess that I'm partly worried about the astropy implementation, even though that should not be impacting how the spec is defined. I'm not sure at this point how difficult it would be to support.

Is your java version working now with this? What does it do for an empty field in a variable array? Give a zero-length array?

mbtaylor commented 3 years ago

@taldcroft yes, we did discuss this before at https://github.com/astropy/astropy/pull/11569#issuecomment-824199421, and I said I didn't have strong views about whether blank values should be permitted for array-valued columns.

However, on reflection I think blanks should be permitted here, as you say for consistency and also because I think people are going to want to be able to write data like that, for instance a Gaia column containing a flux time series where some rows have time series data and others don't.

Yes my java implementation handles this naturally. How the data is stored ... it depends, but as far as the API goes, the content of each cell is a java Object, and that can be returned as either an array object or a null value (the data for the whole column is not available in the API as a free-standlng data structure).

If it turns out to be hard or impossible to represent those blank values faithfully within the Astropy implementation I'd say that's OK, just provide the best representation of a blank value that's feasible, maybe a zero-length array or an array of all nan's or whatever. My representation of the ECSV data model is not going to be perfect either, e.g. I can't represent blank array elements of subtype int*[...] array-valued cells, so they'll just have to be zeros.

taldcroft commented 3 years ago

@mbtaylor - It turned out to be just a few lines of code to support blank values for masking data (plus tests of course, https://github.com/astropy/astropy/pull/11720), so I'll update the APE-6 PR accordingly. I did end up on a detour with https://github.com/numpy/numpy/issues/18981. Unfortunately in Python if you mask a variable length array with a blank, numpy does something funky to make it appear as an unmasked zero-length array instead of a masked value. We can live with that and it doesn't impact APE-6.

taldcroft commented 3 years ago

@mbtaylor - I pushed three new commits that you can review. The new missing values section has some slight waffling about whether a blank entry counts as a missing value. I was always bothered by not being able to represent a valid zero-length string, so for astropy invented an alternate way using a secondary column of the masks. This is nominally part of the (undocumented) astropy-2.0 schema. Anyway, suggestions welcome.

Hopefully this is converging. I think the astropy "bug fix" that implements all the missing value support is basically done. I put in support for missing values in variable length arrays this morning.

taldcroft commented 3 years ago

BTW, I was not able to understand that issue with the long literal line in the Header Details section. I stared at the source for a long time and AFAIK it was valid RST. For some reason it wasn't rendering right, so I took an end run and changed the words.

mbtaylor commented 3 years ago

I only have two remaining tiny niggles:

The Subtype data section refers to "three types of extended data that can be represented: fixed-dimension array data, variable-length array data, and object data". But the first two subsections are titled "fixed-length" and "variable-length" not "fixed-dimension" and "variable-length" array data. Make this terminology consistent by rewording text or subtitles?
The first sentence in the "Variable-length array data" section says "the data cell are arrays", should be "the data cells are arrays".

Otherwise, all looks good to go!

taldcroft commented 3 years ago

@mbtaylor - looking at this fresh I didn't like the "fixed-dimension" description. I changed to "multidimensional", which I hope is clear enough and definitely more common usage.

mbtaylor commented 3 years ago

I'm not sure that replacing "fixed-length" with "multidimensional" is an improvement, since "variable-length array" subtypes can be multidimensional too (and "multidimensional array" subtypes can be 1-dimensional).

But the terminology is consistent, and the explanatory text and examples are clear, so if you prefer it this way I'm happy to accept it.

taldcroft commented 3 years ago

@mbtaylor - thanks. In that case let's stick with what is there. Since @mhvk has approved and nobody else has commented since I posted on astropy-dev, I think we can go to the next step and I will request final approval from the rest of the Coordination Committee.

Thanks for the great input which has been immensely helpful. I'm pretty happy with where this has ended up now as a result.

eteq commented 3 years ago

The Coordination Committee discussed this today and have approved this APE. In the last commit I updated the info accordingly, and will now merge

mbtaylor commented 3 years ago

Excellent. Thanks a lot @taldcroft et al. for getting this done, I agree it's worked out well. One question: ECSV 0.9 had a DOI, will ECSV 1.0 get one, and where should I go looking for it?

taldcroft commented 3 years ago

@mbtaylor - the updated DOI will be added as part of the standard APE acceptance process, hopefully soon.

astropy / astropy-APEs

Update APE-6 for subtype support #68