bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

48 controled vocabulary issues revealed #628

Closed wraff closed 2 years ago

wraff commented 2 years ago

Dear all, after mentioning a few times that there are some issues with non-consistent vocabulary usage I finally read all sdrf tables and compared all labels/column-names (using my R package wrProteo) for consistency. This revealed that 48 out of 189 sdrf annotations at this moment had some inconsistencies (in some cases with even more than 10 columns). Based on the terms most frequently used (lower case) I created 48 separate issues, specifying precisely which column-names should be changed to which controlled vocabulary terms. There may be oven more issues (like MS2 vs MS-MS), here I focused on minor/major caps issues which are obvious. After all I was surprised by the elevated number of inconsistent entries. Thus; I suggest you to regularly check all entries for consistent format. Best greetings, Wolfgang Raffelsberger

StSchulze commented 2 years ago

Hi Wolfgang,

Based on the description of the file format (https://github.com/bigbio/proteomics-metadata-standard/blob/master/sdrf-proteomics/README.adoc#81-sdrf-proteomics-format-rules), SDRF files are case insensitive:

Case sensitivity: By specification the SDRF is case insensitive, but we RECOMMEND using lowercase characters throughout all the text (Column names and values).

I agree that it would be nice to have consistency between the files (I guess that's also why lowercase is recommended), but since this is not enforced in the file format, I don't think that it needs to be checked or modified.

ypriverol commented 2 years ago

Hi @wraff I have been extremely busy with PRIDE related topics. Thanks a lot for your comments, and I will review these inconsistencies in all SDRFs.

Most of these inconsistencies as @StSchulze commented comes from RECOMMENDED behaviors of the format instead of an actual rule. Also, some of these inconsistencies comes from the evolution of the examples. But, as you said, they must be reviewed more often.

I will create a PR updating and correcting some of these issues and I will add you to the loop for review.