RECETOX / galaxytools

Set of Galaxy tool wrappers developed at RECETOX
MIT License
13 stars 13 forks source link

add mzml-validation tool #319

Closed maximskorik closed 1 year ago

maximskorik commented 1 year ago

Description

This PR adds a Galaxy tool to validate mzML files against HUPO XML Schema Definition (XSD) versions 1.1.1 and 1.1.0 (fetched from https://www.psidev.info/mzML).

The tool:

maximskorik commented 1 year ago

Is the openms tool maybe able to do it? openms_xmlvalidator ?

@bgruening, that tool seems to be what we need and I somehow overlooked it when looking for a solution within existing Galaxy tools. However, I can't make it work. Not with mzML files and schemas, nor with some simple xml-xsd pairs (e.g., this one: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms764613(v=vs.85)).

As a general question, would you be interested to upgrade this tool to be a validator for any XML schema? You could ship a few default ones but also make the schema an input and convert this tool into a very generic one - that can be used by many communities. Maybe even contribute it to IUC.

Sure, I wouldn't mind making the tool more generic if it can be useful for a greater community. @hechth, what do you think?

bgruening commented 1 year ago

@bgruening, that tool seems to be what we need and I somehow overlooked it when looking for a solution within existing Galaxy tools. However, I can't make it work. Not with mzML files and schemas, nor with some simple xml-xsd pairs (e.g., this one: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms764613(v=vs.85)).

Do you have an error report? Have you tried this on EU? We have contacts to the devs if this is relevant.

maximskorik commented 1 year ago

Do you have an error report? Have you tried this on EU? We have contacts to the devs if this is relevant.

Yes, on EU. I managed to make it work partially; there were problems with my test sample. The tool works with generic xml-xsd pairs and mzml-xsd v1.1.0 pairs. It still fails to validate mzMLs v1.1.1. I suspect it's due to v1.1.1 containing a reference to v1.1.0 that is absent at the runtime. I don't know why that's not an issue with our validator.

Here's the history with failed v1.1.1 validations: https://usegalaxy.eu/u/ac0ea6f59b164798b8fba7d76d2a6fad/h/mzml-validation-1

hechth commented 1 year ago

@maximskorik sure, I think making a general purpose xml-xsd validator tool would be nice.

hechth commented 1 year ago

The openms_fileinfo tool also seems to struggle with new orbi mzml files: https://umsa.cerit-sc.cz/u/hechth/h/20230119-openms-fileinfo-test

hechth commented 1 year ago

@bernt-matthias are the openms galaxy tools somehow auto generated or manually curated?

hechth commented 1 year ago

@bgruening and @maximskorik I think this can be merged and maybe we can make a general purpose xml validator in the next iteration?

I also think that this tool is somewhat complementary to the openms_fileinfo tool @bernt-matthias and @sneumann

bgruening commented 1 year ago

The tool on its own is great, just added two more comments. An extension to be more general would be great, maybe create an issue for that?

bernt-matthias commented 1 year ago

@bernt-matthias are the openms galaxy tools somehow auto generated or manually curated?

Yes they are.

The conversion of the tools happens here https://github.com/galaxyproteomics/tools-galaxyp/blob/423304b26e63d23cd8e5fb4c2fb729c5beea1254/tools/openms/generate.sh#L62 .. based on the CTD files that are written by the OpenMS tools (https://github.com/galaxyproteomics/tools-galaxyp/blob/423304b26e63d23cd8e5fb4c2fb729c5beea1254/tools/openms/test-data.sh#L143).

hechth commented 1 year ago

The tool on its own is great, just added two more comments. An extension to be more general would be great, maybe create an issue for that?

I guess we can create an issue on the iuc github repo and start contributing such general puspose tools there?

bgruening commented 1 year ago

The tool on its own is great, just added two more comments. An extension to be more general would be great, maybe create an issue for that?

I guess we can create an issue on the iuc github repo and start contributing such general puspose tools there?

Yes :)

hechth commented 1 year ago

@bernt-matthias Yeah that makes sense - there are plenty of them and it would be quite hard to update all of them manually I assume

bernt-matthias commented 1 year ago

There is also https://github.com/galaxyproteomics/tools-galaxyp/blob/423304b26e63d23cd8e5fb4c2fb729c5beea1254/tools/openms/SemanticValidator.xml

The failing OpenMS XMLValidator might be caused by a tool bug. On the command line the schema files are named .bioml. This is because the automatic mapping between OpenMS and Galaxy datatypes (i.e. extensions) fails here (https://github.com/galaxyproteomics/tools-galaxyp/blob/423304b26e63d23cd8e5fb4c2fb729c5beea1254/tools/openms/XMLValidator.xml#L21).

I could try to fix this. Maybe here https://github.com/galaxyproteomics/tools-galaxyp/pull/697 .. is this desired?

Maybe someone could test the tool on the command line 1st .. if you have some file pairs I could do it as well.

Manually curated tests could be added here https://github.com/galaxyproteomics/tools-galaxyp/blob/423304b26e63d23cd8e5fb4c2fb729c5beea1254/tools/openms/aux/macros_test.xml#L568 but ideally we would add them upstream since Galaxy tests are also autogenerated from the OpenMS test command lines in this file: https://github.com/OpenMS/OpenMS/blob/develop/src/tests/topp/CMakeLists.txt

Note that the tool claims to check against the latest schema of the corresponding type

bernt-matthias commented 1 year ago

XMLValidator should/could be the general purpose tool ..?