PRIDE-Archive / xi-mzidentml-converter

Apache License 2.0
0 stars 1 forks source link

Kojak Dataset Validation #81

Open sureshhewabi opened 2 months ago

sureshhewabi commented 2 months ago

Validating Crosslinking data:

sureshhewabi commented 2 months ago

lxml.etree.XMLSyntaxError: Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194 2024-09-19 11:25:40 - main - ERROR - Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194 (interact-1_2.ipro.mzid, line 2)


I will fix with xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

sureshhewabi commented 2 months ago

Traceback (most recent call last): File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir id_parser.parse() File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse self.main_loop() File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop spectrum = peak_list_reader[sid_result["spectrumID"]] File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in getitem return self.reader[spec_id] File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 274, in getitem raise SpectrumIdFormatError( parser.peaklistReader.PeakListWrapper.SpectrumIdFormatError: MS:1000774 not supported for mzML!

@ypriverol

colin-combe commented 2 months ago

can you share the mzIdentML file? I think MS:1001530 is the only supported SpectrumIdFormat for mzML, but it could be open to interpretation, p.8 of the 1.2.0 schema is the relevant part. Looking at the mzIdentML file and seeing what the values of the spectrum IDs is would help,

C

colin-combe commented 2 months ago

yeah, the spectrumID attributes for SpectrumIdentificationResult elements have values like "211026EWas03_F2.34324.34324.4", so that doesn't meet the requirements for MS:1000774, which needs to be of format "index=xsd:nonNegativeInteger" (p.8 of 1.2.0 schema).

Changing the SpectrumIDFormats to MS:1001530 may make it work.

There may still be an open question about whether spectra in mzML files can be referenced using just the index, my feeling is we've been into this before and perhaps they can't. There might be some debate about this.

colin-combe commented 2 months ago

we could try just switching the SpectrumIDFormats in lines 507902 to 507932 of the mzIdentML file (i.e. swapping them to MS:1001530)

colin-combe commented 2 months ago

re. p8 of the 1.2.0 schema, i think "NativeID" refers to IDs in proprietary file formats, e.g. things form Therma/Waters/Bruker, so thats in part why MS:1000774 isn't applicable to mzML.

I think if we go into the pyteomics library we might find it doesn't allow mzML spectra to be retrieved by index only, it's something we can look into (and potentially ask pyteomics devs about) if it becomes an issue.

sureshhewabi commented 2 months ago

Thanks @colin-combe for your comments.. I change the SpectrumIDFormats to MS:1001530 as follows:

  <Inputs>
   <SearchDatabase id="sdb_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/ABRF_iPRG_XL_2023_DECOY.fasta" name="ABRF_iPRG_XL_2023_DECOY.fasta">
    <FileFormat>
     <cvParam accession="MS:1001348" cvRef="PSI-MS" name="FASTA format"/>
    </FileFormat>
    <DatabaseName>
     <userParam name="ABRF_iPRG_XL_2023_DECOY.fasta"/>
    </DatabaseName>
   </SearchDatabase>
   <SpectraData id="sd_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas01_E1.mzML" name="211026EWas01_E1">
    <FileFormat>
     <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
    </FileFormat>
    <SpectrumIDFormat>
     <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
    </SpectrumIDFormat>
   </SpectraData>
   <SpectraData id="sd_1" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas02_F1.mzML" name="211026EWas02_F1">
    <FileFormat>
     <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
    </FileFormat>
    <SpectrumIDFormat>
     <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
    </SpectrumIDFormat>
   </SpectraData>
   <SpectraData id="sd_2" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas03_F2.mzML" name="211026EWas03_F2">
    <FileFormat>
     <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
    </FileFormat>
    <SpectrumIDFormat>
     <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
    </SpectrumIDFormat>
   </SpectraData>
  </Inputs>

Still getting errors:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir
    id_parser.parse()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse
    self.main_loop()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop
    spectrum = peak_list_reader[sid_result["spectrumID"]]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in __getitem__
    return self.reader[spec_id]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 251, in __getitem__
    spec = self._reader.get_by_id(spec_id)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1152, in get_by_id
    elem = self._find_by_id_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1119, in _find_by_id_reset
    return self._find_by_id_no_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 660, in _find_by_id_no_reset
    raise KeyError(elem_id)
KeyError: '211026EWas01_E1.00061.00061.2'
2024-09-19 13:25:49 - __main__ - ERROR - '211026EWas01_E1.00061.00061.2'
colin-combe commented 2 months ago

i think the converter is correct in saying the file referred to has no spectrum with ID "211026EWas01_E1.00061.00061.2"

(the file referred to was 211026EWas01_E1.mzML, to find that you need to search for "211026EWas01_E1.00061.00061.2" in the mzId and then see the associated id of the spectra data and look that up in the Inputs element, though the beginning of the ID they used gives a strong clue it will be that file.)

sureshhewabi commented 2 months ago

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

colin-combe commented 2 months ago

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

can refer them to this GH issue then any further discussion needed can take place here

ypriverol commented 2 months ago

I will try.

mhoopmann commented 2 months ago

My apologies, our software here automatically interprets TPP/mzML spectrum nomenclature and ProteoWizard/mzML spectrum nomenclature (e.g., 211026EWas01_E1.00061.00061.2 == controllerType=0 controllerNumber=1 scan=61). I seem to have taken that for granted and I will fix the mzID files to use ProteoWizard/mzML spectrum nomenclature throughout.

mhoopmann commented 2 months ago

New mzID files have been uploaded.

sureshhewabi commented 2 months ago

@colin-combe I tested the newly uploaded mzID file(which I copied to you in the same FTP location) gives long error messages like this which are not helpful for debugging.

[parameters: {'id_m0': 'sii_34501_1', 'upload_id_m0': 9, 'spectrum_id_m0': 'controllerType=0 controllerNumber=1 scan=50434', 'spectra_data_id_m0': 0, 'multiple_spectra_identification_id_m0': None, 'multiple_spectra_identification_pc_m0': None, 'pep1_id_m0': 2982, 'pep2_id_m0': None, 'charge_state_m0': 2, 'pass_threshold_m0': True, 'rank_m0': 1, 'scores_m0': '{}', 'exp_mz_m0': 622.773865, 'calc_mz_m0': None, 'sip_id_m0': 0, 'id_m1': 'sii_34502_1', 'upload_id_m1': 9, 'spectrum_id_m1': 'controllerType=0 controllerNumber=1 scan=50436', 'spectra_data_id_m1': 0, 'multiple_spectra_identification_id_m1': None, 'multiple_spectra_identification_pc_m1': None, 'pep1_id_m1': 9502, 'pep2_id_m1': None, 'charge_state_m1': 2, 'pass_threshold_m1': True, 'rank_m1': 1, 'scores_m1': '{}', 'exp_mz_m1': 750.88916, 'calc_mz_m1': None, 'sip_id_m1': 0, 'id_m2': 'sii_34503_1', 'upload_id_m2': 9, 'spectrum_id_m2': 'controllerType=0 controllerNumber=1 scan=50438', 'spectra_data_id_m2': 0, 'multiple_spectra_identification_id_m2': None, 'multiple_spectra_identification_pc_m2': None, 'pep1_id_m2': 4088, 'pep2_id_m2': None, 'charge_state_m2': 2, 'pass_threshold_m2': True, 'rank_m2': 1, 'scores_m2': '{}', 'exp_mz_m2': 614.776733, 'calc_mz_m2': None, 'sip_id_m2': 0, 'id_m3': 'sii_34504_1', 'upload_id_m3': 9, 'spectrum_id_m3': 'controllerType=0 controllerNumber=1 scan=50440', 'spectra_data_id_m3': 0, 'multiple_spectra_identification_id_m3': None ... 697400 parameters truncated ... 'rank_m46496': 1, 'scores_m46496': '{}', 'exp_mz_m46496': 572.768249, 'calc_mz_m46496': None, 'sip_id_m46496': 2, 'id_m46497': 'sii_9083_1', 'upload_id_m46497': 9, 'spectrum_id_m46497': 'controllerType=0 controllerNumber=1 scan=13393', 'spectra_data_id_m46497': 2, 'multiple_spectra_identification_id_m46497': None, 'multiple_spectra_identification_pc_m46497': None, 'pep1_id_m46497': 4394, 'pep2_id_m46497': None, 'charge_state_m46497': 2, 'pass_threshold_m46497': True, 'rank_m46497': 1, 'scores_m46497': '{}', 'exp_mz_m46497': 613.256044, 'calc_mz_m46497': None, 'sip_id_m46497': 2, 'id_m46498': 'sii_9084_1', 'upload_id_m46498': 9, 'spectrum_id_m46498': 'controllerType=0 controllerNumber=1 scan=13394', 'spectra_data_id_m46498': 2, 'multiple_spectra_identification_id_m46498': None, 'multiple_spectra_identification_pc_m46498': None, 'pep1_id_m46498': 5082, 'pep2_id_m46498': None, 'charge_state_m46498': 2, 'pass_threshold_m46498': True, 'rank_m46498': 1, 'scores_m46498': '{}', 'exp_mz_m46498': 711.303225, 'calc_mz_m46498': None, 'sip_id_m46498': 2, 'id_m46499': 'sii_9085_1', 'upload_id_m46499': 9, 'spectrum_id_m46499': 'controllerType=0 controllerNumber=1 scan=13395', 'spectra_data_id_m46499': 2, 'multiple_spectra_identification_id_m46499': None, 'multiple_spectra_identification_pc_m46499': None, 'pep1_id_m46499': 6672, 'pep2_id_m46499': None, 'charge_state_m46499': 2, 'pass_threshold_m46499': True, 'rank_m46499': 1, 'scores_m46499': '{}', 'exp_mz_m46499': 548.814697, 'calc_mz_m46499': None, 'sip_id_m46499': 2}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)
colin-combe commented 2 months ago

yes, thats obviously not helpful for debugging and this is the sort of thing that needs improved as we move towards a more usable mzIdentML validator.

Though that wasn't the full output from it, was it? (maybe it was)

Anyway, I'll look into it and get back to you. Whatever the error is, I'll try to update the code to make the output more meaningful in the case of that error.

sureshhewabi commented 2 months ago

Same sort of output getting repeated and it will just fill up the buffer with these kind of JSON objects.

Thanks!

colin-combe commented 2 months ago

could also be a bug in the converter and nothing to do with validation

colin-combe commented 2 months ago

I think the file is invalid at the schema level due to duplicate ids.The same ids for SpectrumIdentificationResults and SpectrumIdentificationItems recur in the different SpectrumIdentificationLists.

The scope within which these ids are meant to be unique is perhaps open to interpretation from the text in the specification document. But I've attached the start of the output from xmllint --noout --schema mzIdentML1.2.0.xsd interact-1_2-fixed.mzid

outputfile.txt

(its also why the converters output was meaningless, it kinda assumes the input is schema valid.)

mhoopmann commented 2 months ago

Gotcha, sorry about that. Makes sense that each SpectrumIdentificationResult.id and SpectrumIdentificationItem.id should have unique values external to their SpectrumIdentificationLists, especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

new mzID files have been uploaded (*fixedB.mzid)

colin-combe commented 2 months ago

especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

Yes, the sqlalchemy errors Suresh was seeing were caused by duplicate primary keys.

sureshhewabi commented 2 months ago

Good news! Dataset is parsed successfully. Thank you to everyone for working to make this happen.