Proposed changes for mzml 1.1.1

hechth commented 1 year ago

Closes #8

Changed the type of the id attribute to be a string allowing everything but whitespace characters and replacing everything that is a whitespace with a single space (which is invalid).

edeutsch commented 1 year ago

Since it is challenging to see the differences because these are new files, for the record, the changes are: 1) Bump in version numbers 2) Change from Windows line-endings to Unix line-endings. Will this cause a problem for anyone??? 3) Change:

  <xs:complexType name="RunType">
        <xs:attribute name="id" type="xs:ID" use="required">
          <xs:annotation>
            <xs:documentation>A unique identifier for this run.</xs:documentation>

to:

        <xs:attribute name="id" use="required" type="xs:string">
          <xs:annotation>
            <xs:documentation>A unique string identifier for this run.</xs:documentation>

The primary reason for this is that many converters simply use the filename as the id, but xs:ID limits the permitted characters and as a result many mzML files in the wild are invalid because the encoded run id uses disallowed characters.

Since there seems little reason to insist that run id conforms to xs:ID, we will just change it to xs:string.

Does this seem sensible to @chambm and others?

mobiusklein commented 1 year ago

An XML parser should not (edited) be affected by changing the line ending.

All seems well.

On Fri, Mar 10, 2023 at 1:34 PM Eric Deutsch @.***> wrote:

@.**** approved this pull request.

Seems good to me. Are we all fine with the change in line endings from Windows to Unix?

— Reply to this email directly, view it on GitHub https://github.com/HUPO-PSI/mzML/pull/9#pullrequestreview-1335410575, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK4E6KOAT2D7EBWRKNQMRLW3NX2HANCNFSM6AAAAAAVWNOEQM . You are receiving this because your review was requested.Message ID: @.***>

hechth commented 1 year ago

@edeutsch @mobiusklein Thanks for having another look at it!

chambm commented 1 year ago

You're talking about the line endings in the schema itself changing? I can't imagine that causing a problem for any validating parser, but note that pwiz doesn't use one (it doesn't validate against the schema). Pwiz actually does encode the XML ID and IDREFs properly (escaping the starting character if it's not alphabetical or '_') but I can certainly believe a lot of other file writers don't.

Note that an empty string is not a valid ID, but is a valid xs:string. It probably needs a constraint added to disallow that.

edeutsch commented 1 year ago

I may have been wrong about the line endings. This may have been an artifact of how I was trying to diff the files.

But the main issue is the run id change from xs:ID to xs:string. Any objections?

Your point about the empty string is useful. Do we want to take step to disallow that?

It is also useful to learn the msconvert does follow the xs:ID convention. What do you do when the filename begins with a number?

chambm commented 1 year ago

It is also useful to learn the msconvert does follow the xs:ID convention. What do you do when the filename begins with a number?

We replace it with underscore _x0000 and the zeros get replaced by the hexadecimal value of the original character (it would probably mess up on non-ASCII input though). Our xml reader reverses that encoding to get the original value back. Note that the non-first characters can only be A-Za-z0-9, underscore, and hyphen. So we have to do a lot of encoding in the middle of the ID too.

mobiusklein commented 1 year ago

There's no reason to allow it to be empty, as that'd violate the other semantics of xsd:ID. I've added the necessary XSD directive.

chambm commented 1 year ago

One nice thing about this change is it'll be backward compatible. Any xs:ID is also a valid xs:string. So even if I don't change pwiz it'll be creating valid 1.1.1 files. Although it won't necessarily fix existing writers that are writing 1.1.0 files without proper xs:ID semantics (I mean you have to get each one of those writers to start writing 1.1.1 as the version instead).

edeutsch commented 1 year ago

So we now have this thanks to Joshua:

        <xs:attribute name="id" use="required">
          <xs:annotation>
            <xs:documentation>A unique string identifier for this run.</xs:documentation>
          </xs:annotation>
          <xsd:simpleType>
              <xsd:restriction base="xsd:string">
                  <xsd:minLength value="1"/>
              </xsd:restriction>
          </xsd:simpleType>
        </xs:attribute>

with the minLength 1 added.

Is everyone okay with this? Approvals? comments?

hechth commented 1 year ago

thanks for all to look into this! @chambm if that is the case the error must lie somewhere else - probably in the ThermoRawFileConverter tool.

HUPO-PSI / mzML

Proposed changes for mzml 1.1.1 #9