PRIDE-Archive / xi-mzidentml-converter

Apache License 2.0
0 stars 1 forks source link

MzIdentML Validation Feature #78

Open sureshhewabi opened 1 week ago

sureshhewabi commented 1 week ago

This is a simple start, and let's use this issue to discuss on validation

colin-combe commented 1 week ago

one issue here is that to fully validate the mzIdentML file you also need the peaklists, e.g. https://github.com/PRIDE-Archive/xi-mzidentml-converter/issues/81

colin-combe commented 1 week ago

what are peoples views on how to deal with that? Two alternatives would be:

colin-combe commented 1 week ago

what if we concentrate on (a) 'Validate files of a given folder(Input will be file path)', and this folder must also contain the peaklist files?

This is easiest to do because its most like how the converter already works.

Also, if it just stops after the first error, then that's easier.

Thoughts on this?

colin-combe commented 5 days ago

@sureshhewabi - https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/82 - you can take a look at what I've done there

that PR gives a command line validation option.

So, as a first attempt, i think covers 1. (a), (c), (d), (e) to very a limited extent, and (f) above. 1.(b) we could live without in short term. 1 (g), as i read it, isn't really validation but summary stats, these could be got by querying the sqlite DB.

For 2. above, info is printed to standard output, think it currently includes the logging info we usually see from the converter.

Its not extensively tested. It passes the file Diogo provided. It fails the schema invalid Kojak file.

sureshhewabi commented 5 days ago

what if we concentrate on (a) 'Validate files of a given folder(Input will be file path)', and this folder must also contain the peaklist files?

This is easiest to do because its most like how the converter already works.

Also, if it just stops after the first error, then that's easier.

Thoughts on this?

Yes I agree with that

colin-combe commented 5 days ago

I agree with that

good, that's the way it works in that PR

colin-combe commented 3 days ago

Its not currently rejecting files that don't have the sequences in Seq elements. (That additional requirement of ours.) It means they break later. (Also of no use to PDB-IHN without sequences?) I'll need to change so it rejects these.

colin-combe commented 3 days ago

Also, I think I've found another requirement specific to our system - that all Modifications have masses given.

colin-combe commented 3 days ago

Also, I think I've found another requirement specific to our system - that all Modifications have masses given.

hmm, i think we shouldn't add that as a requirement, rather the spectrum viewer is broken in some cases at the moment. (There are other ways the modification masses could be recovered, like the UNIMOD accessions i think.)