astrodataformat / usecases

Paper 2 use cases and requirements
0 stars 0 forks source link

Use Case 6: discussion #7

Open msdemlei opened 9 years ago

msdemlei commented 9 years ago

Unless I misunderstand this use case, it proposes to allow embedding some sort of executable code into the data format.

If that is true, I believe it the use case should be dropped, for at least the following reasons:

(1) Security concerns: Even if you "sandbox" whatever code is executing (which, of course, makes it more likely that the format's execution facilities in the end will be too slow or restricted to be generally useful), it's still going to be hard to control what apparently innocuous files actually do (see Adobe's pain with Javascript in PDF).

(2) Ease of implementation: If we allow something like this, all conforming implementations will have to include an interpreter for whatever code this turns out to be. This will typically be a major effort (or at least dependency) that's going to hurt adoption (not to mention security concerns again). On the other hand, I've always wanted to write a FORTH machine...

(3) Complexity considerations: As file formats are always at the "edges" of computer systems, it's great if they are "verifiable" in some sense (e.g., checking validity with a context free grammar). This feature is deep, deep within the land of Turing complete languages with all the related problems ("will this image halt?"). That's a fairly fundamental flaw for something that sounds like a fairly exotic application that would probably better be solved by local convention (a pipeline manual might state: "look for the chunk labeled 'foo-execute', check for foo's signature via the foo-signature chunk, and then just do it").

embray commented 9 years ago

I agree--if nothing else this use case needs clarification and/or narrowing. For example, there is a narrow sense in which this might be useful. For example for WCS or possibly other data reduction uses it would be possible to embed simple instructions for sequences of transformations and arithmetic functions to perform on some data in the file. But as currently written this use case read to me like stored procedures, as in a database, and that I think we want to avoid.

brianthomas commented 9 years ago

No, the intention here was not to embed any executable or compiled code; its basically what Erik wrote above. The idea was that some restricted set of notation/instructions would be adopted in the standard so that some parts of the data could be algorithmically described (and generated). Libraries, regardless of actual implementation language would have to support parsing, and executing, the instruction set.

embray commented 9 years ago

This one does need to be handled with care though. If we allow mathematical transformations on image data, why not allow, say, virtual tables created from joins of other tables, or other such database-like operations? I don't think we should have such a requirement, but why do we privilege one type of embedded data transformation over another? I'm not sure how to write this use case in such a way that addresses that slippery slope.

msdemlei commented 9 years ago

Hi,

On Mon, Jan 12, 2015 at 02:20:47PM -0800, Erik Bray wrote:

This one does need to be handled with care though. If we allow mathematical transformations on image data, why not allow, say, virtual tables created from joins of other tables, or other such database-like operations? I don't think we should have such a

I'd maintain it's a question of the type of machine required to execute the embedded specifications, and I'd say we should be "below Turing" in some sense.

Now, for use cases like the specification of generalised transforms evidently common mathematical expressions would be required, and such expressions are, in themselves, probably not computable by pushdown automata -- but I'd not be worried about these, accepting "normal math" as elementary operations doesn't look dangerous to me.

Loops and function definitions are an entirely different beast. The difference essentially is that it's easy to reason about what the expressions do, whereas with loops and recursion it's at least hard and in general impossible. Conditionals are a bit in between, but something like SQL's CASE should be useful for many important expressions (splines, say) while probably not poisoning the language with computability problems.

The bottom line is that if we come up with a spec on this, we should find some computability experts and ask them for their opinions...

Cheers,

     Markus
nxg commented 9 years ago

This approach, as Brian glosses it, is similar to the approach used by the AST library. That library provides general WCS support not by listing a number of algorithms and parameters, in the style of FITS-WCS, but by implementing a collection of general transformations which can be composed to provide complicated transformations on the data (and several of which are precomposed to provide the standard WCS mappings). That manifestly works in that case, and it's easy to see how it might work for a more general case of data transformations.

The transformations are specified within NDF files in (if I recall correctly; it's been a while) a not terribly readable form. One could imagine a little language which articulated them in a more naturally editable form.

brianthomas commented 9 years ago

I've tried to re-write this use-case a little based on the discussion here. I've dropped the idea of generation of theory datasets from the use case and focused it more on tablular and image transformation/value generations. Please take a look and feedback. I expect we'll still need to iterate.

mdboom commented 9 years ago

In my view, use case #6 still reads more like a feature in search of a use case than a use case. It would be helpful to understand the reasons why such a feature would be important, and why it must be part of a storage file format.

I understand why descriptions of coordinate transformations is essential: it allows for mapping between logical and physical coordinates without the problems that come with resampling the data. It could be done with a fixed lookup table (and HST had a history of that in some cases), but being able to tweak knobs of the transformation has proven very useful.

I'm not as sold on the reasons why algorithmically-generated data must be specified in the file format, rather than as an adjunct tool or extension for that purpose. Particularly given that the file format will support the storage of structured metadata, one could store a procedure in the file that could be understood by some domain-specific tool in the future. I don't think the file format should require anything like this, as it adds significantly to the implementation burden and has the potential to create many more security holes where otherwise there would be few.

brianthomas commented 9 years ago

There appears to be some confusion here still. Another attempt to explain this is that allowing simple mathematical formulae to describe the data is a good thing, if only from the standpoint of compression of large datasets. It also promotes long-term understanding of the data since you can see succinctly what the underlying formula (and perhaps scientific principle, as applicable) are behind that portion of the dataset.

I'm not as sold on the reasons why algorithmically-generated data must be specified in the file format, rather than as an adjunct tool or extension for that purpose.

I'd be all for more complex generation of data sitting in an (optional, outside of the spec) plugin which has compiled code. Where the line is between "simple mathematical formulae" and "complex generation" is another matter.