java.lang.NumberFormatException: For input string: "1-" over some data datasets

mstambou commented 3 years ago

Hey guys,

I use MSGF+ quite often, while ive been heavily using the one developed in 2014 for my analysis I ran into a problem with a dataset which kept breaking MSGF+ the minute it starts searching spectra against database. The raw files were obtained from the following study:

https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-019-0631-8

I downloaded the raw files from their FTP site found at:

ftp://massive.ucsd.edu/MSV000082823/raw/raw/

There are a total of 572 RAW files in this study however when I try to search them using MSGF+ Im only able to run MSGF+ end to end without breaking on less than 10 of these samples.

an example of a raw file that is breaking MSGF+: 20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01.raw and an example of a raw file where MSGF+ runs end to end: 20120302_HMP_ICD_58_TP_A150_Elite_30K_Run1_08.raw

I used MSConvert to convert the RAW files to mgf format using the default parameters, (which is subset). I also tried searching these files with the latest release of MSGF+ (August 2020) I run into the same problem. This is the error I get when I run it:

20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01
MS-GF+ Release (v2020.07.02) (5 August 2020)
Java 1.8.0_275 (Red Hat, Inc.)
Linux (amd64, version 3.10.0-1160.6.1.el7.x86_64)
Loading database files...
Warning: Sequence database contains 8046 counts of letter 'X', which does not correspond to an amino acid.
Counting number of distinct peptides in ribP_elonF_all_cdHit_100_proteinSeqs.revCat.csarr using ribP_elonF_all_cdHit_100_proteinSeqs.revCat.cnlcp
Counting distinct peptides: 13.67% complete.
Counting distinct peptides: 27.67% complete.
Counting distinct peptides: 41.23% complete.
Counting distinct peptides: 55.22% complete.
Counting distinct peptides: 69.11% complete.
Counting distinct peptides: 83.21% complete.
Counting distinct peptides: 97.20% complete.
Loading database finished (elapsed time: 15.63 sec)
Reading spectra...
java.lang.NumberFormatException: For input string: "1-"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.valueOf(Integer.java:766)
    at edu.ucsd.msjava.parser.MgfSpectrumParser.readSpectrum(MgfSpectrumParser.java:119)
    at edu.ucsd.msjava.msutil.SpectraIterator.next(SpectraIterator.java:54)
    at edu.ucsd.msjava.msutil.SpectraIterator.next(SpectraIterator.java:12)
    at edu.ucsd.msjava.msutil.SpecKey.getSpecKeyList(SpecKey.java:117)
    at edu.ucsd.msjava.msutil.SpecKey.getSpecKeyList(SpecKey.java:74)
    at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:266)
    at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:113)
    at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:61)
[Error] Invalid value for parameter -i: ../../HAPiID_6k_out_MSGF_latest/HAPiID_results/20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01_ribP_elonF/20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01.mzid
        (file does not exist)

The authors in the original work used the tool MyriMatch to make identifications. This is an old tool and Im not sure if its still supported, their website is also broken and I could not download this tool. I also use MSGF+ most of the time and I want to be consistent. Please your help is appreciated regarding the matter. Thanks.

alchemistmatt commented 3 years ago

Why don't you convert these to mzML? I suspect the analysis would work fine. Nevertheless, I can look into the issue when using a mgf file.

mstambou commented 3 years ago

hi @alchemistmatt thanks for your reply. Sure I don't mind converting it to mzML instead of mgf. What tools should I use to convert? I have not used the linux version of msconvert yet, since a few years ago it was only available on windows and I got stuck using that all this time and then migrating the mgf files to the server. can you point me to a reliable linux tool that does the conversion from RAW to mzML and also the parameters that I should use? I read somewhere on MSGF+ that it recommends using peak picking is this true? if you dont mind sharing a directory where I can download RAW to mgf/mzML converting tool and an example of a command line command that I should use dictating the parameters? Thanks a lot for your help.

FarmGeek4Life commented 3 years ago

ProteoWizard now has a Docker image that facilitates running msconvert on linux. That is still probably the best tool for this purpose.

See https://hub.docker.com/r/chambm/pwiz-skyline-i-agree-to-the-vendor-licenses

alchemistmatt commented 3 years ago

The de facto tool for creating .mzML files is msconvert, which is part of Proteowizard. I use this command:

msconvert.exe --32 --mzML --zlib --filter "peakPicking true 1-" *.raw

The recommended way to run msconvert on Linux is to download and use their Docker image. I briefly looked for a tutorial on the details of spinning up and using their Docker image, but I didn't find one.

mstambou commented 3 years ago

DOCKERS!!! I cannot use a docker unfortunately since I work on my lab servers and do not have root privileges. Is there a workaround dockers without sudo access?

FarmGeek4Life commented 3 years ago

Well, you can't install Docker without root access, but since they recently released Docker with rootless mode (on December 8th, 2020), you may be able to convince those managing the lab servers to install Docker v20.10 or newer, and only permit rootless mode use. https://docs.docker.com/engine/security/rootless/

alchemistmatt commented 3 years ago

Actually, https://docs.docker.com/engine/security/rootless/#prerequisites states: "Rootless mode does not require root privileges even during the installation of the Docker daemon, as long as the prerequisites are met."

Still, having a sys admin install Docker is probably better

mstambou commented 3 years ago

sounds good and thanks for the advice. I will reach out to the group and ask of they will do me a favor, also for the time being I don't mind at all converting all the files on my windows server then migrating to the other one. I have just converted the same raw file that I mentioned in my initial post into mzML using the subset option and ran MSGF+ and its still running passed the stage where it kept breaking over the mgf version of the file ... I will keep this running and hopefully itll see it to its end.

meanwhile is it better to use peak picking option for mzML with MSGF+ over subset or this just depends on the instrumentation and the Mass spectrometers used to generate the raw data? and if mzML/peak picking usually works better with MSGF+ should I just leave it to the default parameters? in the windows version for msconvert there is a choice for MS Levels in the case of peak picking, im attaching a screen shot of their gui here:

https://drive.google.com/file/d/1cetnPQ5yfGVNIkct1mX_OzGxmz2MqFoq/view?usp=sharing

alchemistmatt commented 3 years ago

MS-GF+ cannot process profile-mode MS/MS spectra; they must be centroided. That's why I use this, which centroids both MS1 and MS2 spectra:

--filter "peakPicking true 1-"

You could use the following, which only centroids MS2 spectra, but that gives a larger .mzML file

--filter "peakPicking true 2-"

mstambou commented 3 years ago

@alchemistmatt thanks for the answer,

so what is the equivalent option on msconvert for this command: "--filter "peakPicking true 1-" in the picture I sent in my previous command? Should I change the value of the MS levels option? Thanks.

mstambou commented 3 years ago

Im sorry I just realized I needed to give you permission to view that photo on the link can you check again now I just granted access. Thanks.

alchemistmatt commented 3 years ago

I tracked down the source of the bug: negative charge states in the .mgf file, for example:

TITLE=20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01.1998.1998.1
RTINSECONDS=612.098
PEPMASS=463.864379882813 4476.598999023438
CHARGE=1-
315.9298096 7.2203502655
317.9240112 3.2349731922
318.9039307 16.1067180634
320.8908081 12.9888219833
END IONS

MS-GF+ does not support negative precursor ions, which these .raw files have (they also have lots of positive mode scans). I am updating MS-GF+ to ignore negative charge states for .mgf files (and warn the user of their presence).

I also need to update it to look for negative mode scans in .mzML files and warn about them. For example, the .mzML file for 20121223_ICD_individual_33_TP_A177_Elite_30k_run2_01.raw has

          <cvParam cvRef="MS" accession="MS:1000129" name="negative scan" value=""/>

and

                <selectedIon>
                  <cvParam cvRef="MS" accession="MS:1000744" name="selected ion m/z" value="461.867492675781" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
                  <cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="1"/>
                  <cvParam cvRef="MS" accession="MS:1000042" name="peak intensity" value="4787.19856262207" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
                </selectedIon>

alchemistmatt commented 3 years ago

The screen shot of the GUI has what I suggest: vendor-based peak picking for all spectra, which is what this means:

--filter "peakPicking true 1-"

The "1-" means MS1 and higher spectra, which is "all spectra".

alchemistmatt commented 3 years ago

Note: you need to click "Add" to add that peak picking "filter" to the list of active filters. It will appear after the titleMaker filter; that's OK.

mstambou commented 3 years ago

oh wow thank you for pointing that out, I did not even realize that I wasnt adding the filter.

Concerning your previous comment about the negative charges, should I wait then for the updated MSGF+? or this would work over the mzML format with peak picking without breaking MSGF+? because it is still running on that one sample (in mzML) that kept breaking in the mgf version.

alchemistmatt commented 3 years ago

My code changes for .mzML files will not change the analysis; it will just show some warnings. Thus, the version you're using now is fine. Both it (and the new version) will treat negative mode spectra as positive mode and proceed to attempt the find a match, using the reported charge state as a positive charge. That attempt will fail, since it doesn't support negative mode spectra, but it won't lead to an error; just low scoring results.

alchemistmatt commented 3 years ago

The fact that the current .mzML file works is not surprising: these .raw files stored the MS2 spectra as centroid-mode data. Thus, peak picking isn't needed. Still, I like to use it since it won't hurt, and it will lead to smaller files if the MS1 spectra are profile (in this case, these .raw files also have centroid mode MS1 spectra).

mstambou commented 3 years ago

I see. No I will definitely use peak picking from now on like you mentioned it will not hurt. I also dont really care whether or not its mgf or mzML as the input MSGF+ will output in MZID anyways and after that I can convert it to a table format furthermore by your scripts and carry on my analysis. Thank you much for pointing out all of this to me today I appreciate your time!!

mstambou commented 3 years ago

Hi @alchemistmatt for the samples that I previously ran MSGF+ where I converted them mgf format using the "subset" criteria in msconvert tool, Do you think that the peptide identification rate will be affected by the fact that I used "subset" or "peak picking" and do you recommend that I re-run over those other samples as well? or for the ones that MSGF+ successfully ran without breaking then it should not matter?

alchemistmatt commented 3 years ago

The Subset option allows you to define a narrower range of scans to include in the .mzML file that MSConvert creates. Did you define a scan range or filter by charge state? If no, then you didn't really use a subset; you used all of the scans, and there is no need to re-run the previous analyses.

Examine the .mzid files (or the .tsv files created using the Mzid-To-Tsv-Converter) to confirm that you have results from a wide range of scan numbers. If you do, the existing analyses are likely fine.

alchemistmatt commented 3 years ago

The bug parsing "CHARGE=1-" in .mgf files has been fixed via commits c8128b9839521d72dfc0fce119d11110f48cb834 and ead8c0a130d85725f23cf552781992b039894180

If you want to continue using .mgf files, download the latest release from https://github.com/MSGFPlus/msgfplus/releases

mstambou commented 3 years ago

yeah that makes sense. Thanks a lot and thanks for adding that feature for the .mgf cases.

mstambou commented 3 years ago

when running MSGF+ over different datasets concurrently (i.e. in parallel) and when using the same reference database, would it bother/break MSGF+ if I use the same database files? i.e. the same .revCat.canno, .revCat.cnlcp, .revCat.csarr, .revCat.cseq and .revCat.fasta?

or is it better for each dataset I make a copy of these files? Im talking about dozens to 100 independant jobs referencing these files at the same time.

FarmGeek4Life commented 3 years ago

You can run MS-GF+ in parallel using the same reference database, as long as the database mapping files you mentioned are created beforehand (either by running MS-GF+ on one of the datasets, or by using the suffix array build tool BuildSA).

MSGFPlus / msgfplus

java.lang.NumberFormatException: For input string: "1-" over some data datasets #115