PRIDE-Archive / xi-mzidentml-converter

Apache License 2.0
0 stars 0 forks source link

Datasets from SIM-XL, Mascot and ProteomeDiscover in PRIDE #63

Open ypriverol opened 3 months ago

ypriverol commented 3 months ago

We have to find out the list of datasets with the following conditions:

Please lets update the list in this issue.

sureshhewabi commented 3 months ago

Datasets in PRIDE with "crosslink" or "cross-link" word in TITLE which contains mzIdentML files:

Ordered by priority:

Need to check version 1.2, the corresponding peak list and producer

colin-combe commented 3 months ago

as noted in meeting, they might not be complete submissions (what does the strikeout represent above? PXD018935 / PXD012759)

colin-combe commented 3 months ago

"crosslink" or "cross-link" word in TITLE

wasn't there a "crosslink" tag people were referring to? (I don't know but people spoke of this)

ypriverol commented 3 months ago

We will continue with different combinations. We will et you know when errors start to happen.

colin-combe commented 3 months ago

OK, great, thanks!

ypriverol commented 3 months ago

@sureshhewabi reported the following error in this one:

PXD014359 - Error parsing C_Lee_141014_CRM_dialysis_NCE20_2.mzid MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",)) 2024-04-04 09:35:10 - main - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))

colin-combe commented 3 months ago

i guess the error message is correct and it is not valid XML

ypriverol commented 3 months ago

Another similar error for OpenMS

Error parsing XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",)) 2024-04-04 14:28:13 - parser.process_dataset - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))

sureshhewabi commented 3 months ago

Schema seems valid in XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid file from PXD021417: xmllint --noout --schema mzIdentML1.2.0.xsd XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid > XLpeplib_Beveridge_QEx-HFX_DSS_R1_output_file1.txt 2>&1

There should be an issue with the parser. @colin-combe any idea?

colin-combe commented 3 months ago

yes, could be a issue with parser. Or perhaps something to do with character encoding. I looked into and was confused.

@sureshhewabi - I'm not sure what your xmllint command does ?

sureshhewabi commented 3 months ago

It is a command to check the schema validity against the schema definition file(xsd)

colin-combe commented 3 months ago

one problem is the empty location attribute for spectra data:

XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid, line 527672: <SpectraData location="" id="SDAT_1534307058980521776">

It is a required attribute, but empty string is enough to make the file schema valid (http://www.datypic.com/sc/xsd/t-xsd_anyURI.html).

But this isn't the only problem, there's something else that's still mysterious...

sureshhewabi commented 3 months ago

This means we cannot use this dataset for us anyway, isn't it? because we cannot find the peaklist file

colin-combe commented 3 months ago

we could manually fix the location. But also, yes, there is a problem with the parser. It is requiring some elements that are optional. (Breaks if they're not there.) I'll provide an update (will make PR when its fixed). I would stop testing datasets until this is fixed.

colin-combe commented 3 months ago

think this fixes a problem - https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/64 the datset with the empty location still won't work, but maybe some of the other ones throwing errors like that will.

sorry about that

sureshhewabi commented 3 months ago

PXD021417 Dataset Issues:

sureshhewabi commented 3 months ago

PXD026603 Dataset Issues:

parser.process_dataset - INFO - parsing AnalysisProtocolCollection- start Error parsing GPR158-RGS7-Gb5_CONSENSUS.mzid KeyError 'ModificationParams' parser.process_dataset - ERROR - 'ModificationParams'

colin-combe commented 3 months ago

thanks, will check it

colin-combe commented 3 months ago

similar to before - parser was treating things that are optional as if they were required fixed by https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/65

re PXD026603 - the peaklists are missing?

sureshhewabi commented 3 months ago

Yes, peakfile is missing too: <SpectraData location="C:\Users\griffinlab.PG18844\Dropbox (Scripps Research)\Griffin Lab\fusion lumos\TSS\XLMS\20210129 GPR158 Complex\GPR158_Complex_XLMS\GPR158-RGS7-Gb5_CONSENSUS.mzML" name="MzML spectra file" id="ID_MZML_FILE_with_spectra"> but GPR158-RGS7-Gb5_CONSENSUS.mzML is not available