The majority of annotations fail validation against appropriate organism template

levitsky commented 4 years ago

I tried going through the annotated datasets with sdrf-pipelines validator. On some of them, it breaks with a traceback, and on others it shows error messages if run with an appropriate organism template (mostly human).

I think the following needs to be done:

[x] ~~Implement error codes for easier validation in scripts (0 for no errors, non-zero for schema violations and format violations).~~ (was added but not used)
[x] Fix exceptions in SDRF validator (I can submit an issue or PR there).
[x] Fix CI on this project to properly validate all data sets. Right now validation doesn't work at all, it just silently fails with zero exit code which is registered as success. We need to make it
- [x] work, and
- [x] use organism templates (still needs improvement).
[ ] Add mandatory columns in existing annotations, fix other errors.
[x] Update the validator for new requirements.
[x] Another issue which may become problematic as we discuss changing the required columns is this: column sets for each "template" are hard-coded in the validator, while this repository has illustrative template files. Should we maybe tie them together somehow, to ensure that they don't diverge in the future?

ypriverol commented 4 years ago

@levitsky I have quite a lot on my plate now. If you can do PRs will be great. Thanks a lot.

ypriverol commented 4 years ago

@levitsky I think its time to decide the mandatory columns for Human samples. I see that we have ethnicity given errors in most of the datasets. Probably, we need to make two decisions:

make ethnicity optional.

Make the pipeline pass but give the recommendation to the user that ethnicity is RECOMMENDED.

ypriverol commented 4 years ago

@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?

levitsky commented 4 years ago

I agree that it is a good time to decide but I would suggest keeping the topics separated. Adding a warning in the validator when RECOMMENDED columns are missing is a good idea and perhaps this should be tracked separately. Would be great if pandas schema supported warnings, or if not then perhaps a different kind of validation errors can be defined and tracked separately.

Discussion of ethnicity and developmental stage is in #220. As for ethnicity, it apparently needs to be substituted with ancestry group; if you also want to make it recommended rather than mandatory, it's fine with me, but I don't see a problem either way.

Another open discussion on mandatory columns is #218. I think all arguments have been made and you can just decide how you feel on MS2 analyzer.

After closing these two issues with can update the validator and move forward with fixing existing annotations.

ypriverol commented 4 years ago

Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case. I will go for issue #218.

levitsky commented 4 years ago

@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?

I think that the validation should check for existence of all columns that are defined as mandatory, first for any annotation, then the organism, and including the mandatory mass spec columns. That's what mandatory means.

The other question is how we want to use these template files to demonstrate the minimal requirements. My idea was to add a second validation script specifically for templates, which would check that the templates contain all mandatory columns and only them.

Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case.

The SDRF is not checked against human template if it is not human now (or at least it should not be). The organism is extracted from the file itself. That logic is implemeted in validate-all.py. I am planning to extend it to other organisms, but for now the logic is:

check with default template;
detect organism;
- if human, check against human template;
- (other template selection logic to be added here);
check mass spec columns.

Whichever stage fails first is the one that defines the error message.

ypriverol commented 4 years ago

I got it now, I need to review with the other curators @qinchunyuan all the submissions that are failing. This is high-priority.

levitsky commented 4 years ago

I updated the list of tasks. I can probably update the requirements in the validator and work on organism template selection logic in the script. Also I can later add validation for template files to solve the last task, as I suggested above.

bigbio / proteomics-sample-metadata

The majority of annotations fail validation against appropriate organism template #222