Drop `TaskArguments` models, in favor of `validate_arguments`

tcompa commented 1 year ago

This issue is not yet well defined, but we will have to do something in this direction.

jluethi commented 1 year ago

Let's discuss this.

Are we talking about moving argument defaults from the python function to the pydantic models? In that case, the Python functions would loose a lot of value in respect to being usable standalone and I wouldn't be a fan.

How do other libraries using pydantic handle that?

tcompa commented 1 year ago

TL;DR

Further review of this issue showed us a slightly different, and promising, way forward (all details below, as part of a verbose log); we are not trying to move from a basic example to something that can actually work. If it does work, then:

We fully remove the redundancy issue concerning types and defaults
We would need to find a workaround to include extra information in the JSON Schemas (e.g. argument descriptions), but that looks like a reasonable trade-off.

Are we talking about moving argument defaults from the python function to the pydantic models? In that case, the Python functions would loose a lot of value in respect to being usable standalone and I wouldn't be a fan.

There's more than "just" the defaults (although I agree that removing the defaults from the function signature would be the most relevant change). The information that we can encode for each argument includes:

Its default value, as in fun(x=1, ...).
Its type hint, as in fun(x: int), fun(x: Optional[int]), .
A type hint + casting rule, as in TaskArguments(BaseModel): x: int. Note some differences with the previous one:
- This one is such that TaskArguments("1").x is the integer 1 (and not the string "1").
- The previous one is standard Python type hint, that can be used for static code analysis (e.g. via mypy, or plugins for many IDEs).
Possibly all attributes of pydantic.Field, which includes some useful ones (e.g. description), some possibly-useful ones (e.g. title and examples) and many probably-irrelevant ones.

How do other libraries using pydantic handle that?

I'm not aware of someone using it the same way as we (plan to) do. This is partly because enforcing type hints at runtime is not a common practice in Python, and partly because we also want to use Pydantic for additional purposes (i.e. generating JSON Schemas).

For the moment:

We use a Pydantic model to preprocess the arguments of a task function (this is similar to "enforcing type hints at runtime"):
- We take an existing set of arguments (typically from a JSON file like args.json) and use it to create a TaskArguments instance args_instance.
- This performs basic validation: if a required argument is missing, it raises an error.
- This performs variable type casting: if an argument x="123" is a string but an integer is needed, it is converted to 123 (with an error raised if the casting fails).
- This assigns default values (if present) to unset attributes.
- If the instance was created without errors, we switch back to a Python dictionary (by using the .dict()method ofargs_instance`), and use this dictionary as the set of arguments for the task function.
We plan to also use the same TaskArguments Pydantic model to generate a schema compliant with the JSON Schema specification, through the schema() method of Pydantic v1 (renamed to model_json_schema in Pydantic v2).

This two-fold purpose leads to the redundancy issue: where is each information declared? In the function signature and/or in the Pydantic model?

If we stick with point 1, then in principle we could move to an even better solution, where we simply never declare a Pydantic model. This would be via the @validate_arguments decorator (note that is technically a beta feature, but it's been there for 5 minor releases already..):

The validate_arguments decorator allows the arguments passed to a function to be parsed and validated using the function's annotations before the function is called. While under the hood this uses the same approach of model creation and initialization; it provides an extremely easy way to apply validation to your code with minimal boilerplate.

This would immediately close this issue: all information would strictly belong to the function call signature, since there wouldn't even exist a Pydantic model. What is missing, however, is that we still need to generate the JSON Schema, and we'd still need a Pydantic model for that purpose.

In fact, I just realized that this model could still be extracted from the Pydantic validate_arguments function, although it uses a not-so-well-documented feature -- see example

from devtools import debug
from typing import Optional
from pydantic import validate_arguments

def fun(
        x: int,
        y: Optional[str],
        z: dict = {"a": "b"},
        ):
    pass

debug(validate_arguments(fun).model.schema())

which outputs something similar to what we need

{
  "title": "Fun",
  "type": "object",
  "properties": {
    "x": {
      "title": "X",
      "type": "integer"
    },
    "v__duplicate_kwargs": {
      "title": "V  Duplicate Kwargs",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "y": {
      "title": "Y",
      "type": "string"
    },
    "z": {
      "title": "Z",
      "default": {
        "a": "b"
      },
      "type": "object"
    },
    "args": {
      "title": "Args",
      "type": "array",
      "items": {}
    },
    "kwargs": {
      "title": "Kwargs",
      "type": "object"
    }
  },
  "required": [
    "x",
    "y"
  ],
  "additionalProperties": false
}

Note that apart from understanding and then removing some spurious properties (v__duplicate_kwargs, args, kwargs), this would be quite close to our target. It's true that we wouldn't be able to include (e.g.) descriptions (or any other of the pydantic.Field attributes), but we can likely find other places to define them. There would be an additional cost for including these extra details, but we would get the basic feature (both type-hint enforcing and schema generation) in a single place.

A comment about Pydantic version

The example above is with Pydantic v1, which is in bug-fixes-only state, so that it will keep working. Pydantic v2 will refactor this feature, but without removing it. Thus I'd say that it's something we can safely rely upon.

tcompa commented 1 year ago

TL;DR

Further review of this issue showed us a slightly different, and promising, way forward (all details below, as part of a verbose log); we are not trying to move from a basic example to something that can actually work. If it does work, then:

We fully remove the redundancy issue concerning types and defaults

We would need to find a workaround to include extra information in the JSON Schemas (e.g. argument descriptions), but that looks like a reasonable trade-off.

As of 87a38b5, we have a first implementation of this new approach - which seems to work.

What is left (apart from further tests) is

[x] To decide how/where we could define argument descriptions. Note that the question only concerns the place where we should define them, but injecting them into the JSON Schema should not be too complex (at least if we neglect nesting). Question: can we simply get them from the docstring? Note that this could be error-prone, but an error in the definition of a description is a relatively minor issue (plus we can add tests also for this, e.g. checking that all arguments have a description). The obvious advantage would be to have everything in the same place (function call signature + docstring). Ref: #383
[ ] To write a test with a function like def function(channels: list[Channel] = []), to make sure. EDIT: deferred to #386.

tcompa commented 1 year ago

Re-opening, since this is not yet complete with #369 (see to-do list in previous comment)

tcompa commented 1 year ago

(closing since what was left is not part of #386)

fractal-analytics-platform / fractal-tasks-core

Drop `TaskArguments` models, in favor of `validate_arguments` #377

A comment about Pydantic version