HUPO-PSI / mzML

Repository for mzML and the corresponding examples
28 stars 16 forks source link

`xs:ID` type of `id` attribute of `RunType` makes many mzml files starting with digits as file or sample name invalid. #8

Closed hechth closed 1 year ago

hechth commented 1 year ago

We're currently implementing a validator tool in Galaxy that simply takes an mzml file and uses the XSD schema to validate files using pyxml or xmllinter and we found that the <run> ... </run> field has an id attribute which has to be an xs:ID type, meaning it can't start with a number. But proteowizard seems to be filling this field with the sample name, which can contain a number at the start (and often does, like the position, order in the study, timestamp etc.), meaning that many mzml files which start with a number are technically invalid.

I personally don't see a reason why the id of the RunType can't be a xs:string - was there a specific reason for the decision?

I'd therefore like to propose to change the id attribute of the RunType from xs:ID to xs:string.

If you agree with the change I can open up a PR with the requested changes to the XSD file. Is there anything else that has to be adapted to make this change?

maximskorik commented 1 year ago

Also, the main purpose of xs:ID type is to ensure a unique identifier for every XML element of that type. However, since the maxOccurs attr is not defined for the RunType, there can be only 1 run element per mzML, making the unique identifier unnecessary.

RunType definition in mzML1.1.0.xsd (where the maxOccurs is to be specified):

<xs:sequence>
  ...
  <xs:element name="run" type="dx:RunType" />
</xs:sequence>

from https://www.w3.org/TR/xmlschema-0/: The default value for both the minOccurs and the maxOccurs attributes is 1.

Another reason why xs:ID might be needed is to have a target for IDREF or IDREFS attributes. However, I couldn't find any element that targets the ID of the run.

Thus, it appears that xs:ID may be safely replaced with xs:string. Note that switching the type to string will require adding a regexp to ensure the id doesn't contain any whitespaces.

edeutsch commented 1 year ago

Closing since addressed in PR #9