Open ypriverol opened 3 months ago
Datasets in PRIDE with "crosslink" or "cross-link" word in TITLE which contains mzIdentML files:
Ordered by priority:
Need to check version 1.2, the corresponding peak list and producer
as noted in meeting, they might not be complete submissions (what does the strikeout represent above? PXD018935 / PXD012759)
"crosslink" or "cross-link" word in TITLE
wasn't there a "crosslink" tag people were referring to? (I don't know but people spoke of this)
We will continue with different combinations. We will et you know when errors start to happen.
OK, great, thanks!
@sureshhewabi reported the following error in this one:
PXD014359 - Error parsing C_Lee_141014_CRM_dialysis_NCE20_2.mzid MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",)) 2024-04-04 09:35:10 - main - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))
i guess the error message is correct and it is not valid XML
Another similar error for OpenMS
Error parsing XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",)) 2024-04-04 14:28:13 - parser.process_dataset - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))
Schema seems valid in XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid
file from PXD021417
:
xmllint --noout --schema mzIdentML1.2.0.xsd XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid > XLpeplib_Beveridge_QEx-HFX_DSS_R1_output_file1.txt 2>&1
There should be an issue with the parser. @colin-combe any idea?
yes, could be a issue with parser. Or perhaps something to do with character encoding. I looked into and was confused.
@sureshhewabi - I'm not sure what your xmllint
command does ?
It is a command to check the schema validity against the schema definition file(xsd)
one problem is the empty location attribute for spectra data:
XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid, line 527672:
<SpectraData location="" id="SDAT_1534307058980521776">
It is a required attribute, but empty string is enough to make the file schema valid (http://www.datypic.com/sc/xsd/t-xsd_anyURI.html).
But this isn't the only problem, there's something else that's still mysterious...
This means we cannot use this dataset for us anyway, isn't it? because we cannot find the peaklist file
we could manually fix the location. But also, yes, there is a problem with the parser. It is requiring some elements that are optional. (Breaks if they're not there.) I'll provide an update (will make PR when its fixed). I would stop testing datasets until this is fixed.
think this fixes a problem - https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/64 the datset with the empty location still won't work, but maybe some of the other ones throwing errors like that will.
sorry about that
PXD021417 Dataset Issues:
<Seq>
is missingPXD026603 Dataset Issues:
parser.process_dataset - INFO - parsing AnalysisProtocolCollection- start Error parsing GPR158-RGS7-Gb5_CONSENSUS.mzid KeyError 'ModificationParams' parser.process_dataset - ERROR - 'ModificationParams'
thanks, will check it
similar to before - parser was treating things that are optional as if they were required fixed by https://github.com/PRIDE-Archive/xi-mzidentml-converter/pull/65
re PXD026603 - the peaklists are missing?
Yes, peakfile is missing too:
<SpectraData location="C:\Users\griffinlab.PG18844\Dropbox (Scripps Research)\Griffin Lab\fusion lumos\TSS\XLMS\20210129 GPR158 Complex\GPR158_Complex_XLMS\GPR158-RGS7-Gb5_CONSENSUS.mzML" name="MzML spectra file" id="ID_MZML_FILE_with_spectra">
but GPR158-RGS7-Gb5_CONSENSUS.mzML
is not available
We have to find out the list of datasets with the following conditions:
Please lets update the list in this issue.