Proposal for refactoring & higher maintainability

di-hardt commented 6 months ago

Hey,

I noticed the SDRF validator is a bit difficult to maintain and to understand in it's current state, as it is one long python script. To make it more maintainable and extendable for future changes and integration, I would like to propose a refactoring and at the same time the integration of a data validation framework like pydantic.

Pydantic basically adds validator to class attributes via so called Field-objects. While it already supports basic validator like decimal and length constraints there is the option to implement custom ones, e.g. checking validity of ontology terms.

There are multiple options how to implement this. My suggestions is to create one pydantic model for each SDRF template record (e.g. cell-line, human, plant, ...) (record == line in SDRF).

Pytandic already supports JSON which would play along with the plans of a JSON representation described in https://github.com/bigbio/proteomics-sample-metadata/issues/696
I already looked into the possibility to read automatically parse CSV/TSVs.

A structure like this can also be easily tested, as each validator can be tested by itself, as well as in combination with others in one of the record types.

A possible structure could look like this

|
|-- records
|   |-- cell_line_record.py
|   |-- ...
|-- validators
|   |-- onthology_term_validator.py
|   |-- ...
|-- cli
|-- tests
|-- ...

One additional advantage would be the increased expressiveness of error messages as validators will give an exact error message per attribute. Which can than be easily tackled by the SDRF creator or in case of new implementation, by the developer.

The modularity would also allow other projects to include only parts of the validation process, e.g. sdrf_convert need only parts of the validation for variables used by the targeted software e.g. Comet, DIA-NN, ...

ypriverol commented 6 months ago

Hi @di-hardt thanks for your comments. I will first (if you are OK with it) move this issue to the sdrf-pipelines repository because the issue is more related with the python validator than the specification.

ypriverol commented 6 months ago

Hi @di-hardt: Firstly, great to have you on board and thinking about contributing with the SDRF validator and converter (sdrf-pipelines). I will continue discussing in this issue organizing the ideas and then, we can split into small issues when we have decisions about it. I will reply inline.

Hey,

I noticed the SDRF validator is a bit difficult to maintain and to understand in it's current state, as it is one long python script. To make it more maintainable and extendable for future changes and integration, I would like to propose a refactoring and at the same time the integration of a data validation framework like pydantic.

Currently, the tool does two main tasks: validate and convert SDRF into different tools configuration files. In the parse_sdrf both functions are exposed. For each tool supported we have different packages MaxQuant, OpenMS, MSstats. I agree the code needs a refactor for the conversion in some converter classes and also in the validator.

The validator was built originally using the library pandas_schema, which enables to define a pandas schema and perform the validation. More than happy to move it into Pydantic. However, before movement, I would like to make sure that the refactoring is smooth, for example the pandas_schema approach allows us to load easily the SDRF (TSV) into pandas and pass the validator defined in the pandas_schema. Then, would be nice to have (for other contributors in the project) and idea if it is better to quit pandas_schema in favor of pydantic.

Pydantic basically adds validator to class attributes via so called Field-objects. While it already supports basic validator like decimal and length constraints there is the option to implement custom ones, e.g. checking validity of ontology terms.

There are multiple options how to implement this. My suggestions is to create one pydantic model for each SDRF template record (e.g. cell-line, human, plant, ...) (record == line in SDRF).

This is similar to what we have now with pandas_schema.

Pytandic already supports JSON which would play along with the plans of a JSON representation described in bigbio/proteomics-sample-metadata#696 I already looked into the possibility to read automatically parse CSV/TSVs.

Great.

A structure like this can also be easily tested, as each validator can be tested by itself, as well as in combination with others in one of the record types.

A possible structure could look like this
|
|-- records
|   |-- cell_line_record.py
|   |-- ...
|-- validators
|   |-- onthology_term_validator.py
|   |-- ...
|-- cli
|-- tests
|-- ...

Looks OK.

One additional advantage would be the increased expressiveness of error messages as validators will give an exact error message per attribute. Which can than be easily tackled by the SDRF creator or in case of new implementation, by the developer.

The modularity would also allow other projects to include only parts of the validation process, e.g. sdrf_convert need only parts of the validation for variables used by the targeted software e.g. Comet, DIA-NN, ...

Agreed. Let's plan this. This is a big refactoring that needs to be planned step by step because the sdrf-pipelines validator is in active developement with other contributors, and also is in production (in use) in PRIDE, quantms, the proteomics-metadata validator with almost 300 downloas per month.

I propose the following:

[ ] Create a branch/fork called pytandic.
[ ] Start changing some of the validators with more granularity like: ontology_term, or double values (precursor, fragmet tolerances), etc.
[ ] Create tests that enable us to make sure we don't break anything.
[ ] Finally, change the structure of the template validators, in this case, we dont need to have yet the SDRF Json but the template in JSON files which is what is needed to validate I guess.
[ ] Test the branch with exiting production frameworks.
[ ] Extend validatons for new and more extended validators. For example, I imagine by using pydantic we can extend the posibilities of the validators. Would be good before the implementation to have multiple issues discussing which kind of additional validations we would like to do.

Hope this comment make sense. Waiting for your feedback @di-hardt

di-hardt commented 5 months ago

Hey @ypriverol,

thanks for the detailed comments. Makes sense to me, I'm on it.

jpfeuffer commented 5 months ago

I totally support this endeavour. I would love to be able to more easily adapt validators based on downstream applications. E.g. validating an SDRF for converting to a certain downstream tool like MaxQuant or OpenMS may require a set of additional rules for the values in columns since some may not be supported by that tool. If we could "inherit" the common SDRF validation rules and then add stricter rules for certain tasks, this would be awesome. I am pretty sure such a thing would be feasible with pydantic.

ypriverol commented 2 months ago

@di-hardt Have you started on this? We have sometime now to start refactoring some of the code and we think this may be a good first use case.

di-hardt commented 2 months ago

Yes! I started it in the ELIXIR Proteomic Community Github, as we forked the project a while ago. The new implmentation lives in the branch feature-pydantic-based-validation in the sub folder pydantic_based. By moving functionality piece by piece from the original project into the pydantic version, I make sure nothing from the original contributions is lost on the way.

I also explain the structure and everything in the readme.

After my vacation I want to finish the sdrf-default use case.

bigbio / sdrf-pipelines

Proposal for refactoring & higher maintainability #159