DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
32 stars 14 forks source link

Remove pydantic dependency from python client #900

Open shangyian opened 5 months ago

shangyian commented 5 months ago

The DJ server should continue to use Pydantic for API input / response validation, but the python client does need to depend on pydantic and can instead use something from the standard library like dataclasses. This change is motivated by the fact that pydantic could conflict with other libraries in codebases that want to use this client, and the dependency is largely unnecessary for the client interface anyway.

anhqle commented 2 months ago

I can take this one. Is the motivation that the server will validate input anw, so the client can shirk this responsibility?

I see pydantic being used in 3 places:

  1. Do you mean replacing the pydantic models here with just dataclasses?

  2. There's some validator code to parse catalog when it's a dict: code. Remove or refactor? When is catalog a dict?

  3. ClientEntity(BaseModel). I suppose this can just be a regular class? That means users don't get early error during instance initialization, and only later when submit the API request, right? Probably fine since I expect most uses of the client is to submit the API request, not to construct/manipulate Node objects, anw.

shangyian commented 2 months ago

Thanks @anhqle!

Yeah, since the server will always validate input, I don't think we need this additional layer in the client.

Do you mean replacing the pydantic models here with just dataclasses?

Yep, we should be able to just use dataclasses. Or regular classes where appropriate.

There's some validator code to parse catalog when it's a dict: code. Remove or refactor? When is catalog a dict?

The list of available catalogs are stored in a database table, and oftentimes when it's returned by the DJ API, it will be a dict representation of the table row. The parsing that happens here is because most users don't care about the additional server-side info on the catalog that's provided in the dict; they just want the catalog name.

ClientEntity(BaseModel). I suppose this can just be a regular class? That means users don't get early error during instance initialization, and only later when submit the API request, right? Probably fine since I expect most uses of the client is to submit the API request, not to construct/manipulate Node objects, anw.

Yeah, let's convert this to a regular class. If we want an earlier layer of validation, we'll need to be very on top of updating the client to make sure that the data models are up-to-date with the latest server versions, so I tend to think it'll be easier to maintain if we keep most of the validation server-side.

anhqle commented 2 months ago

I took a crack at this. Other than the data validation, the current code makes use of several features from pydantic:

  1. export to dict. dataclass can do this leveraging dataclass.asdict(dict_factory)
  2. ingest from json. This is hard to do since the models can be nested, e.g. Source.__init__(columns: List[Column]). I'm not sure how to this.
shangyian commented 2 months ago

@anhqle good point, I forgot that we ingest from json.

@agorajek do you think it's actually useful from the user perspective that we have objects for some of the responses/nodes like Source or do you think we can get away with having the client just return dictionaries?

Then we could remove things like:

        source_node = Source(
            **response.json(),
            dj_client=self,
        )
anhqle commented 1 month ago

If the original motivation of removing pydantic 1.0 as a dependency so that the client can be imported by other libraries, one option is actually to upgrade pydantic from 1.0 to 2.x

shangyian commented 1 month ago

@anhqle I think part of the issue is that if the other library is using a version of pydantic that differs from this one, it'll have issues. So even if we upgrade (which makes sense), we'll still have potential compatibility issues. In the long run it's probably safer just to either remove or implement the json functionality within the lib, so that we're not dependent on pydantic.