UWPR / Comet

An tandem mass spectrometry (MS/MS) sequence database search tool.
https://uwpr.github.io/Comet/
Apache License 2.0
45 stars 13 forks source link

Invalid mzid-files #38

Closed di-hardt closed 1 year ago

di-hardt commented 1 year ago

Hey, I stumbled across some issues when using the mzid files generated by Comet (2023.01 rev. 2 (7c9150d)). Therefore I tried to validate them using the PSI validator and got a lot of cvc-identity-constraint.4.1: Duplicate unique value-errors. I attached a shortened version of the output.

To reproduce this:

  1. Download the RAW-file 1.1_Std_in-sol.raw from https://www.ebi.ac.uk/pride/archive/projects/PXD037650
  2. Convert it using ThermoRawFileParser (I used the Docker container as follows: docker run -i -t -v $(pwd):/data_input quay.io/biocontainers/thermorawfileparser:1.4.2--ha8f3691_0 ThermoRawFileParser.sh -d /data_input -f 2 -a
  3. Download Homo Sapiens entries from Uniprot (SwissProt only) as FASTA: curl -o human_swiss_prot.fasta https://rest.uniprot.org/uniprotkb/stream\?format\=fasta\&query\=%28Human%29%20AND%20%28reviewed%3Atrue%29
  4. Use Comet with the attached parameter file: ./comet.linux.exe -PPXD037650.comet.params.txt -Dhuman_swiss_prot.fasta 1.1_Std_in-sol.mzML
  5. Download and use the mzIdentML-Validator to validate the resulting mzIdentML. I started it like this java -Xms10240m -jar mzIdentMLValidator-1.4.35-SNAPSHOT.jar 2> stderr.log > stdout.log to capture the output and gave it 10GB memory, it took a while

Best, Dirk

Attachments

jke000 commented 1 year ago

Dirk,

Thanks for reporting the issue and I definitely appreciate your detailed post! I understand where the errors originate from and I've addressed the issue with the commit ea31c8b. Running the mzIdentML Validator on the mzid file searched/created after this fix shows the errors are gone. This fix will show up in the next release (which is not imminent).

di-hardt commented 1 year ago

Thank your very much for the quick response!