Closed levitsky closed 4 years ago
@levitsky I have quite a lot on my plate now. If you can do PRs will be great. Thanks a lot.
@levitsky I think its time to decide the mandatory columns for Human samples. I see that we have ethnicity given errors in most of the datasets. Probably, we need to make two decisions:
Make the pipeline pass but give the recommendation to the user that ethnicity is RECOMMENDED.
@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?
I agree that it is a good time to decide but I would suggest keeping the topics separated.
Adding a warning in the validator when RECOMMENDED columns are missing is a good idea and perhaps this should be tracked separately. Would be great if pandas schema
supported warnings, or if not then perhaps a different kind of validation errors can be defined and tracked separately.
Discussion of ethnicity and developmental stage is in #220. As for ethnicity, it apparently needs to be substituted with ancestry group
; if you also want to make it recommended rather than mandatory, it's fine with me, but I don't see a problem either way.
Another open discussion on mandatory columns is #218. I think all arguments have been made and you can just decide how you feel on MS2 analyzer.
After closing these two issues with can update the validator and move forward with fixing existing annotations.
Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case. I will go for issue #218.
@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?
I think that the validation should check for existence of all columns that are defined as mandatory, first for any annotation, then the organism, and including the mandatory mass spec columns. That's what mandatory means.
The other question is how we want to use these template files to demonstrate the minimal requirements. My idea was to add a second validation script specifically for templates, which would check that the templates contain all mandatory columns and only them.
Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case.
The SDRF is not checked against human template if it is not human now (or at least it should not be). The organism is extracted from the file itself. That logic is implemeted in validate-all.py. I am planning to extend it to other organisms, but for now the logic is:
Whichever stage fails first is the one that defines the error message.
I got it now, I need to review with the other curators @qinchunyuan all the submissions that are failing. This is high-priority.
I updated the list of tasks. I can probably update the requirements in the validator and work on organism template selection logic in the script. Also I can later add validation for template files to solve the last task, as I suggested above.
I tried going through the annotated datasets with sdrf-pipelines validator. On some of them, it breaks with a traceback, and on others it shows error messages if run with an appropriate organism template (mostly human).
I think the following needs to be done:
Implement error codes for easier validation in scripts (0 for no errors, non-zero for schema violations and format violations).(was added but not used)