epiverse-trace / datatagr

Tagging, validating, and safeguarding data to help harden data pipelines.
https://epiverse-trace.github.io/datatagr/
Other
1 stars 0 forks source link

datatagr as the foundation for S3 classes inheriting from data.frames #3

Open Bisaloo opened 5 months ago

Bisaloo commented 5 months ago

We've had a couple of interesting discussions recently on S3 classes with:

In both cases, the S3 classes defined inherit from data.frames. This is convenient and desirable because it gives users a sense of familiarity, given how common data.frames are in the R ecosystem.

However, one major drawback of S3 is that there is no way to officially declare a class. This means users could potentially end up with an invalid S3 object (as defined by the original packages) because they accidentally dropped required columns. It is important to have a mechanism to alert users the object is no longer valid as soon as it happens. Delaying the warning or errors when a specific operation on the object is required can only lead to frustration in users: "when did my object stop being valid exactly?"

The tagging system introduced by the linelist R packages provides a good solution issue. Tagged columns can be made required and users is warned as soon as the column is dropped. This is robust to all data wrangling operations and column renaming.

linelist itself is not the ideal solution because it focuses on a specific type of data (line list data), which may not match perfectly with the data need in vaccineff / scoringutils / downstream packages.

The present datatagr R package, as a generalisation of linelist to generic data.frame, therefore provides the ideal solution to be used as the foundation layer for packages who want to build S3 classes inheriting from data.frames with a safe validation system.

nikosbosse commented 5 months ago

This is really cool! Will look into it.

Bisaloo commented 5 months ago

To clarify: datatagr is still in the early development stages but I wanted to create this issue already because this new role of the datatagr package may inform its development (which @chartgerink is leading).

You can see some preliminary discussion at https://github.com/orgs/epiverse-trace/discussions/221, and look into linelist to see a more specific version of what datatagr may be.