Store function metadata in a machine readable format

asmeurer commented 4 years ago

It would be useful for the test suite to have the function metadata stored in a machine readable format. Currently I am parsing the function signatures from the spec files using some regular expressions, and I will probably end up parsing some other information such as types as well. This works fine for now, but it would be cleaner if this data were stored in a machine readable format, say in JSON, and the relevant parts of the spec documents generated from that automatically.

To be sure, not everything in the spec needs to be in JSON, just the parts that will need to be extracted for other things as well, such as the test suite. There should still be a lot of plain English descriptions of behavior.

This is likely too much work for version 1 given that we already have things inline in the Markdown, but it's something to consider for future iterations.

saulshanabrook commented 4 years ago

That makes sense to me. I wanted to highlight one of the existing JSON formats I am using for python-record-api.

Minimal example generated from this file: https://github.com/data-apis/python-record-api/blob/master/data/api/sample-usage.json

It is specified/documented as pydantic models which are useful to easily serialize/deserialize from python into JSON: https://github.com/data-apis/python-record-api/blob/006faf0bba9cd4cb55fbacc13d2bbda365f5bf0b/record_api/apis.py#L69

For the "leaf nodes" of actual types I also built some pydantic models for different kinds of types: https://github.com/data-apis/python-record-api/blob/006faf0bba9cd4cb55fbacc13d2bbda365f5bf0b/record_api/type_analysis.py#L74. Normal python instances can just be saved with the type names and it has special handling for different generic types (like lists, tuples, etc) or literal types (strings).

asmeurer commented 4 years ago

I don't want to get bogged down in a metaconversation on the "right" way to specify types for array functions. Any specification is fine, as long as it is machine readable. We could consider the JSON as an internal document and not part of the actual spec (i.e., the schema could change between minor spec versions). Some sorts of things that I could imagine wanting to parse here for the tests are:

The function name and signature. I'm happy for this to just be something like add(x1, x2, /), though if we want to split out the parameters that's fine too.
The top-level type of each argument (array, floating point scalar, boolean, etc.)
For those that are arrays,
- valid dtypes
- valid shapes
- broadcastibility requirements with other input parameters
- valid domain of inputs (if not all input values are allowed, e.g., sqrt behavior in the spec is only defined for nonnegative inputs)
Same for the return type
Example inputs and outputs

If you already have some thoughts on the right way to specify these sorts of things, that's great, and we should use it. But I don't want to wait on a meta decision on how to specify types. My main motivation here is to make it so I can generate as much of the test suite automatically from the spec as possible, so that it's easier to keep them in sync.

asmeurer commented 4 years ago

It's also fine if we can't represent some corner cases, at least to begin with. For example, we might not be able to represent valid shapes for something like matmul (it isn't in the spec yet but I think it might be added), but it's fine if I have to hard-code that as long as the shape information works for the majority of other functions.

leofang commented 3 years ago

I am revisiting this issue as I encounter a similar need. Parallel to the need for updating docstrings (#180), we also need this metadata to populate, say TOC of a doc page. Currently in CuPy I am using .. automodule:: to let Sphinx parse all functions under the array_api namespace. This works, but it's not ideal, as I can't control the order of appearance of the functions (Sphinx sorts them alphabetically). If the metadata is provided, I may be able to group them on demand based on the nature of the API (creation, statistics, linalg, etc).

asmeurer commented 3 years ago

I should mention that in the test suite I am parsing parts of the spec and populating some function stubs https://github.com/data-apis/array-api-tests/tree/master/array_api_tests/function_stubs. Feel free to reuse these for your implementation, or use it to extract a manual list of functions. The dictionaries at the top of test_type_promotion.py may also be useful if you plan to restrict input dtypes like the NumPy implementation does (although it should be clear implementations do not need to be minimal like this. We did so for the NumPy one because it is a reference implementation, but dtype restrictions are not required by the spec).

asmeurer commented 1 year ago

Things that it would be useful to have structured data for:

Input and output range (see for example asin)
Special cases
Input dtypes
Output dtype (e.g., "promoted" or "boolean")
Input/output shapes (especially for linear algebra functions)
Output data for functions that returned named tuples
For each of the above, whether it is required or only suggested

We already have effectively structured data for the siguratures and type annotations.

Like I said, there should also be room for plain-text notes, as there will always be things that don't fit into the existing schemes, and we also want the ability to add things like motiations and implementation notes.

data-apis / array-api

Store function metadata in a machine readable format #49