bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

The majority of annotations fail validation against appropriate organism template #222

Closed levitsky closed 4 years ago

levitsky commented 4 years ago

I tried going through the annotated datasets with sdrf-pipelines validator. On some of them, it breaks with a traceback, and on others it shows error messages if run with an appropriate organism template (mostly human).

I think the following needs to be done:

ypriverol commented 4 years ago

@levitsky I have quite a lot on my plate now. If you can do PRs will be great. Thanks a lot.

ypriverol commented 4 years ago

@levitsky I think its time to decide the mandatory columns for Human samples. I see that we have ethnicity given errors in most of the datasets. Probably, we need to make two decisions:

Make the pipeline pass but give the recommendation to the user that ethnicity is RECOMMENDED.

ypriverol commented 4 years ago

@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?

levitsky commented 4 years ago

I agree that it is a good time to decide but I would suggest keeping the topics separated. Adding a warning in the validator when RECOMMENDED columns are missing is a good idea and perhaps this should be tracked separately. Would be great if pandas schema supported warnings, or if not then perhaps a different kind of validation errors can be defined and tracked separately.

Discussion of ethnicity and developmental stage is in #220. As for ethnicity, it apparently needs to be substituted with ancestry group; if you also want to make it recommended rather than mandatory, it's fine with me, but I don't see a problem either way.

Another open discussion on mandatory columns is #218. I think all arguments have been made and you can just decide how you feel on MS2 analyzer.

After closing these two issues with can update the validator and move forward with fixing existing annotations.

ypriverol commented 4 years ago

Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case. I will go for issue #218.

levitsky commented 4 years ago

@levitsky I think the default validation should run against the default template https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/sdrf-default.tsv I think is better because is really the minimum metadata. What do you think?

I think that the validation should check for existence of all columns that are defined as mandatory, first for any annotation, then the organism, and including the mandatory mass spec columns. That's what mandatory means.

The other question is how we want to use these template files to demonstrate the minimal requirements. My idea was to add a second validation script specifically for templates, which would check that the templates contain all mandatory columns and only them.

Yes, my point here is that we should validate against the SDRF default template, not the Human template. We can put another GitHub Actions in the future that detect if the organism is Human, then check the Human template. But failing against Human while is not a human dataset should be the case.

The SDRF is not checked against human template if it is not human now (or at least it should not be). The organism is extracted from the file itself. That logic is implemeted in validate-all.py. I am planning to extend it to other organisms, but for now the logic is:

Whichever stage fails first is the one that defines the error message.

ypriverol commented 4 years ago

I got it now, I need to review with the other curators @qinchunyuan all the submissions that are failing. This is high-priority.

levitsky commented 4 years ago

I updated the list of tasks. I can probably update the requirements in the validator and work on organism template selection logic in the script. Also I can later add validation for template files to solve the last task, as I suggested above.