Use Case 8: Discussion - Githubissues

brianthomas commented 9 years ago

I sat down to extract a requirement or two from this use case today and I am having difficulty doing so.

Is the argument in the use case that the API (for all libraries of the format) always needs to be able to respond with some clear indication of what features are supported by the library in use? Please clarify and/or correct me here. Sorry to be so thick about this.

mbtaylor commented 9 years ago

Sorry, the language is a bit tortured partly because I was thinking in terms of requirements in the first place but trying to phrase it as a use case, and also because there is a set of related concerns I was trying to address. In any case it's not what you said; I'm not talking here about characteristics of "the API", rather about the implementation work required by people writing their own, not necessarily general purpose, data access code.

The basic requirement could be phrased something like "It shall be possible to write natural and efficient code to access data stored in the format from any [reasonable] programming language". I'm tempted to add that it should be possible, and not unduly difficult, to write this code from scratch, i.e. without having to rely on a black box implementation that your code has to sit on top of.

What I'm really trying to get at here is that I'd like a format that resembles FITS in that it's possible, and preferably easy, for a human to read a document describing the format and then go off and write some code from scratch that accesses it. I am nervous about a format like HDF5 (or HDS) that really requires you to use a single official HDF5 library (a specification of the HDF5 file format internals does exist, but it's clearly not envisaged that people should attempt to roll their own data access libraries). Why is there a problem relying on the efficient and well-supported library that the HDF5 people have thoughtfully provided? Some languages (java, javascript) do not talk very happily to C (it'snot impossible, but there are good reasons to avoid it if you can) or you might be running on an OS for which the library is not available, or the library might become unsupported.

So what I'd really like is "The format serialisation shall be simple enough that you can write access code from scratch".

The other thing I was trying to get at is that if you're trying to write data access code in this way, and you only want to do a simple job (e.g. read numeric data from a 2-d array) it should be possible to do that without the code having to do a lot of work unrelated to that job (for instance parse WCS metadata from a related or unrelated data element in the same file). Maybe nobody would design a format without that characteristic anyway and it doesn't need saying.

Sorry the explanation is even more involved than the original use case. If you think that makes enough sense to turn it into one or more requirements please do, otherwise I'm happy to discuss it more.

brianthomas commented 9 years ago

@mbtaylor : OK I think I get the gist of what you are saying now. The statement:

"The format serialisation shall be simple enough that you can write access code from scratch"

Might be a requirement, but it needs to be more quantitative. Its hard to quantify what "..simple enough.." means. I'm sure it differs between individuals depending on their talent. Perhaps I am (again) reading into this, but maybe you are getting at the serialization needs to be "open" (e.g. HDF5 approach is opaque), and therefore the specification, and documentation and ownership should be "open source" (lets not argue licenses just yet). A requirement along these lines might be:

"No proprietary code shall be required in order to serialize the data. All aspects of the serialization shall be documented and open source licensed"

What do you think?

As for:

The other thing I was trying to get at is that if you're trying to write data access code in this way, and you only want to do a simple job (e.g. read numeric data from a 2-d array) it should be possible to do that without the code having to do a lot of work unrelated to that job (for instance parse WCS metadata from a related or unrelated data element in the same file).

I tend to think the solution here is that the parts of the overall data model need to be adequately namespaced. In this fashion the parser may be able to recognize the bits of the data model it knows about, and those it doesn't, freely choosing to either skip over the unknown parts, throw warnings or barf depending on how the implementer of the parser feels about this.

mbtaylor commented 9 years ago

"No proprietary code shall be required in order to serialize the data. All aspects of the serialization shall be documented and open source licensed"

Documentation is necessary, but in practice not sufficient. The HDF5 file format is comprehensively documented at the byte level but the format is just very complicated (it has many options), so it would be a huge effort to write a new implementation. When an HDF Group engineer attended a FITS BOF by videolink a couple of years back I raised the question of whether it would be feasible to implement a pure java HDF5 reader and he basically said, don't try that. I think the basic reason is that since the developers have always assumed that people will use the standard library, the cost of adding many options for data type, data organisation, compression format, etc is low, but if it comes to reimplementation (e.g. the relevant support libraries may not exist in the language of choice) it can be much higher.

Really what I'm trying to do is to flag this up as something that people should think about rather than posit a cast-iron requirement - it may be that the benefits of HDF5 or some other serialization system are so great that this consideration has to go by the board. From that point of view the fact that it's not quantitative doesn't matter so much, but I appreciate for the process that you're trying to follow here it doesn't fit well.

Is it any better to say "There shall be no serious impediments to writing from scratch code to access the serialization"? Or "It shall be feasible to write..."?

mdboom commented 9 years ago

I think its probably worth separating:

"The format serialisation shall be simple enough that you can write access code from scratch"

from

"No proprietary code shall be required in order to serialize the data. All aspects of the serialization shall be documented and open source licensed"

The first is about the technical effort involved in a reimplementation, the second is about the legality of a reimplementation. HDF5, for example, doesn't prohibit the latter, but it makes the former difficult. And I think it's easier to draw a hard line on the legal issues (and use the word MUST there) than on the technical ones.

embray commented 9 years ago

@mdboom beat me to pretty much the same point. I'll add that "from scratch" is very vague terminology, and seems to preclude the use of existing technology in writing a serialization. For example if XML were used in the format, writing a good XML parser/writer is obviously a non-trivial effort, but of course these already exist for most widely used languages and can/should be leveraged in an implementation. (In that case of XML one can hand-roll their own code to output XML using write() calls but this sort of attitude should be strongly discouraged. FITS appears simple enough that many people over the years have felt they could hand-write their own FITS files, leading to slews of invalid FITS in the wild.)

embray commented 9 years ago

@mbtaylor

The other thing I was trying to get at is that if you're trying to write data access code in this way, and you only want to do a simple job (e.g. read numeric data from a 2-d array) it should be possible to do that without the code having to do a lot of work unrelated to that job (for instance parse WCS metadata from a related or unrelated data element in the same file).

That part I agree on. I think it should be simple enough to access at least parts of the data without having to write a complete implementation that can handle any conceivable data that can go into a file (WCS, etc.). I think the ability to do this with FITS is a strength (albeit one that is undercut by the lack of good data model specifications; i.e. instructions for how to interpret all data in the file if one wanted to).

brianthomas commented 9 years ago

Not sure where we are with this now. Was there any agreement about requirements we can extract here? Taking up the 2nd first, a requirement built around partial data access appears to be supported by all. Perhaps that requirement can look like:

The serialization of an instance of the data format shall be written so that a partial parser can be written to pick out only those portions which it finds relevant.

That wording however seems to be a bit "use casey". Alternatively, I might prefer something like:

The serialization of an instance of the data format shall support declaration of semantic content via namespace mechanism.

Does that handle the "easily write a parser from scratch" need then? As for the first issue concerning licensing, is there any agreement that this would be a good requirement (I lean towards yes, but am willing to be convinced otherwise), e.g. the essentially 2 requirements:

"No proprietary code shall be required in order to serialize the data."

and

"All aspects of the serialization shall be documented and open source licensed"

What did I miss?

mbtaylor commented 9 years ago

I agree that this is a good requirement:

"All aspects of the serialization shall be documented and open source licensed"

However the additional formulation in terms of "proprietary code" doesn't really cover my concern - what bothers me is not whether the library provided by the HDF group is proprietary or not (actually I don't know whether it is), it's that you are in practice required to use that particular implementation, so if there's some reason that it's not suitable (e.g. platform compatibility) then you've got a problem.

I agree with @embray that my earlier "from scratch" formulation is not very useful terminology.

So how about:

"Data access shall not require use of any particular item of implementation code"

For the other part (partial parsing) I'm not sure I can come up with anything more succinct than @brianthomas's, so I'm happy with either or both of those.

Mark Taylor Astronomical Programmer Physics, Bristol University, UK m.b.taylor@bris.ac.uk +44-117-9288776 http://www.star.bris.ac.uk/~mbt/

brianthomas commented 9 years ago

I've tried to capture

"All aspects of the serialization shall be documented and open source licensed"

in a set of related requirements Im currently calling "Open Access Policy", e.g.

https://github.com/astrodataformat/requirements/wiki/Requirement-14

Please take a look and comment. I think we could easily descend into the weeds on all related requirements in terms of the legalese to make this really work; I'd hope we can just gloss the main points.

brianthomas commented 9 years ago

And this bit also appears as sub-requirement 14.6

"Data access shall not require use of any particular item of implementation code"

brianthomas commented 9 years ago

Ive added requirement 15 for namespacing functionality

https://github.com/astrodataformat/requirements/wiki/Requirement-15

and the partial parsing in Requirement 16:

https://github.com/astrodataformat/requirements/wiki/Requirement-16

Please take a look and comment

mbtaylor commented 9 years ago

Thanks, I think that covers it.

One point though: 14.3 says "The official API of the format shall be open source licensed.", which seems to presuppose that there is an Official API. An official API might be a good idea, but one can imagine formats (e.g. FITS) without any such thing. So I'd suggest rewording as "Any official API...". Existence of an official API could be formalised as a separate Requirement (that people might or might not agree with) if you want to encourage consideration of that question.

brianthomas commented 9 years ago

@mbtaylor Yeah, good point. I will make that change. I think we probably should have a separate requirement for "official API" or the like, and possibly a Use Case to go with it (Use Case 8 doesn't seem to call that out per se).

As to whether its a good idea or not to have an official API, I am unsure. Possibly so. It would seem at first thought that the API would probably have to be tied to particular language and if that is the case it starts to seem a less good idea (what languages are 'official'? How long would we be supporting API for given language and so on..)

astrodataformat / usecases

Use Case 8: Discussion #9