OpenMS / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
478 stars 317 forks source link

beta testing MSGFPlusAdapter #1251

Closed lars20070 closed 9 years ago

lars20070 commented 9 years ago

A simple fixed Carbamidomethyl (C) mod leads to a crash.

16:52:58 NOTICE: MSGFPlusAdapter of node # 13 started. Processing ... MS-GF+ Beta (v9979) (3/26/2014) D:\data\Lars\temp\2015-02-10_165258_ukmz1005immz_25676_1\msgfplus_mods.txt: AminoAcidSet: Illegal Location at line 5: 57.021464, C, fix, any, Carbamidomethyl Fatal error: Running MS-GF+ returned an error code. Does the MS-GF+ executable (.jar file) exist? MSGFPlusAdapter took 0.33 s (wall), 0.00 s (CPU), 0.00 s (system), 0.00 s (user).

Stortebecker commented 9 years ago

+1. I would very much like MSGF+ running correctly.

hendrikweisser commented 9 years ago

@Stortebecker: Could you replicate the issue?

Because it works for me if I modify the OpenMS test accordingly.

@lars20070: Could you try it with a newer MS-GF+ version (I use v10089)?

Stortebecker commented 9 years ago

Yes, I can reproduce the error. I use version 9979.

lars20070 commented 9 years ago

@hendrikweisser I see v10089 [1] in the change log. We installed what appears to be the latest version [2] but on the command line it states to be version 9979 from 3/26/2014.

[1] https://bix-lab.ucsd.edu/pages/viewpage.action?pageId=13533480 [2] http://proteomics.ucsd.edu/Software/MSGFPlus/MSGFPlus.zip

hendrikweisser commented 9 years ago

Try the MS-GF+ download here: http://omics.pnl.gov/software/ms-gf

lars20070 commented 9 years ago

After updating OpenMS to the latest nightly build, and MSGF+ to v10089 (thanks @hendrikweisser ), no crashes anymore.

@Stortebecker Please continue testing the MSGFPlusAdapter and compare to the MSGF+ results from your batch script. If the results agree, please feel free to close this issue.

[1] http://omics.pnl.gov/software/ms-gf

hendrikweisser commented 9 years ago

@timosachsenberg: Why does this issue have a "UTILS" label? MSGFPlusAdapter is a TOPP tool.

timosachsenberg commented 9 years ago

It should be a util I guess. At least its pretty new...

lars20070 commented 9 years ago

@timosachsenberg We are currently beta testing the adapter on a range of different datasets. If the results agree with the ones from our batch scripts, I see no reason to release the adapter as Util.

Stortebecker commented 9 years ago

Ok, I can confirm that the issue with Carbamidomethyl is gone with v10089. However, there is still a problem with .mzid output. Exactly the same pipeline works with normal (.idXML) output, but when I add a .mzid output node, it crashes:

09:43:33 NOTICE: MSGFPlusAdapter of node #4 started. Processing ...

MS-GF+ Beta (v10089) (7/16/2014) Usage: java -Xmx3500M -jar MSGFPlus.jar -s SpectrumFile (_.mzML, .mzXML, .mgf, .ms2, .pkl or dta.txt) -d DatabaseFile (.fasta or _.fa) [-o OutputFile (.mzid)](Default: [SpectrumFileName].mzid) [-t PrecursorMassTolerance](e.g. 2.5Da, 20ppm or 0.5Da,2.5Da, Default: 20ppm) Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the minus (expMass<theoMass) and 2.5Da to plus (expMass>theoMass) [-ti IsotopeErrorRange](Range of allowed isotope peak errors, Default:0,1) Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation. The combination of -t and -ti determins the precursor mass tolerance. E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n_1.00335Da)<20ppm for n=-1, 0, 1, 2. [-thread NumThreads](Number of concurrent threads to be executed, Default: Number of available cores) [-tda 0/1](0: don't search decoy database %28Default%29, 1: search decoy database) [-m FragmentMethodID](0: As written in the spectrum or CID if no info %28Default%29, 1: CID, 2: ETD, 3: HCD) [-inst MS2DetectorID](0: Low-res LCQ/LTQ %28Default%29, 1: Orbitrap/FTICR, 2: TOF, 3: Q-Exactive) [-e EnzymeID](0: unspecific cleavage, 1: Trypsin %28Default%29, 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage) [-protocol ProtocolID](0: Automatic %28Default%29, 1: Phosphorylation, 2: iTRAQ, 3: iTRAQPhospho, 4: TMT, 5: Standard) [-ntt 0/1/2](Number of Tolerable Termini, Default: 2) E.g. For trypsin, 0: non-tryptic, 1: semi-tryptic, 2: fully-tryptic peptides only. [-mod ModificationFileName](Modification file, Default: standard amino acids with fixed C+57) [-minLength MinPepLength](Minimum peptide length to consider, Default: 6) [-maxLength MaxPepLength](Maximum peptide length to consider, Default: 40) [-minCharge MinCharge](Minimum precursor charge to consider if charges are not specified in the spectrum file, Default: 2) [-maxCharge MaxCharge](Maximum precursor charge to consider if charges are not specified in the spectrum file, Default: 3) [-n NumMatchesPerSpec](Number of matches per spectrum to be reported, Default: 1) [-addFeatures 0/1](0: output basic scores only %28Default%29, 1: output additional features) Example (high-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 20ppm -ti -1,2 -ntt 2 -tda 1 -o testMSGFPlus.mzid Example (low-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 0.5Da,2.5Da -ntt 2 -tda 1 -o testMSGFPlus.mzid [Error] Invalid value for parameter -o: C:\Users\fh851\AppData\Local\Temp\10\2015-02-12_092652_ukmz1005immz_14576_1\TOPPAS_tmp\msgf_test\004_MSGFPlusAdapter\mzid_out\uniprot-human-Nov26-2013-reference-reviewed-canonical-without-isoforms_plus-shuffled.fasta_tmp1 (extension does not match)

MzIDToTsv v9108 (7/16/2014) Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv -i MzIDPath (MS-GF+ output file (.mzid) or directory containing mzid files) [-o TSVFile](TSV output file %28.tsv%29 %28Default: MzIDFileName.tsv%29) [-showQValue 0/1](0: do not show Q-values, 1: show Q-values %28Default%29) [-showDecoy 0/1](0: do not show decoy PSMs %28Default%29, 1: show decoy PSMs) [-showFormula 0/1](0: do not show molecular formula %28Default%29, 1: show molecular formula of peptides) [-unroll 0/1](0: merge shared peptides %28Default%29, 1: unroll shared peptides) [Error] Invalid value for parameter -i: C:\Users\fh851\AppData\Local\Temp\10\2015-02-12_092652_ukmz1005immz_14576_1\TOPPAS_tmp\msgf_test\004_MSGFPlusAdapter\mzid_out\uniprot-human-Nov26-2013-reference-reviewed-canonical-without-isoforms_plus-shuffled.fasta_tmp1 (file does not exist) ......\OpenMS\src\openms\source\FORMAT\HANDLERS\XMLHandler.cpp(103): While loading 'D:\data\Florian\test\msgf_test\MB_UB19_P011012_150113.mzML': unable to open primary document entity 'D:\data\Florian\test\msgf_test\MB_UB19_P011012_150113.mzML' Error: Unable to read file (- due to that error of type Parse Error in: ......\OpenMS\src\openms\source\FORMAT\HANDLERS\XMLHandler.cpp@104-)

09:43:35 ERROR: MSGFPlusAdapter failed!

hendrikweisser commented 9 years ago

@timosachsenberg:

It should be a util I guess. At least its pretty new...

No. Doesn't matter that it's new. It's a search engine adapter, and as such it's supposed to be used in pipelines with other TOPP tools, and it supports the interface requirements for that (being derived from TOPPBase). This makes it a TOPP tool.

hendrikweisser commented 9 years ago

@Stortebecker: The initial error (and probably the cause of the later ones) seems to be due to file naming in TOPPAS. The filename that is passed to MSGFPlusAdapter's "mzid_out" parameter by TOPPAS has the extension "fasta_tmp1" (derived from FASTA input file), which the converter that comes with MS-GF+ then doesn't accept as valid (it expects the extension "mzid").

Can you test whether it works on the command line, if you pass a filename ending in ".mzid" as the "mzid_out" parameter? (This is covered by a test case, so I'm pretty sure it should work.)

cbielow commented 9 years ago

we could tag it as experimental depending on how confident we are

Am 2/12/2015 um 10:42 AM schrieb hendrikweisser:

@timosachsenberg:

It should be a util I guess. At least its pretty new...

No. Doesn't matter that it's new. It's a search engine adapter, and as such it's supposed to be used in pipelines with other TOPP tools, and it supports the interface requirements for that (being derived from TOPPBase). This makes it a TOPP tool.


Reply to this email directly or view it on GitHub: https://github.com/OpenMS/OpenMS/issues/1251#issuecomment-74042905

hendrikweisser commented 9 years ago

@lars20070: Can you please rename this issue to something more general, that includes the second error reported by @Stortebecker?

timosachsenberg commented 9 years ago

Hi @hendrikweisser utils have been for quite some time been the staging area for topp tools. That's the reason why it should be a util and not a tool. It has nothing to do with its functionality or interface.

hendrikweisser commented 9 years ago

@timosachsenberg:

utils have been for quite some time been the staging area for topp tools.

Where do you get this from? That's not the explanation that I was given early on in the project (see my comment above). In addition:

So I don't agree with the idea of using UTILS as a staging ground for new TOPP tools. The distinction should be based on what the purpose of a tool is. New tools that aren't well tested can be marked as "experimental" (as Chris said), which has been common practice.

hendrikweisser commented 9 years ago

I think I can solve the second issue (TOPPAS file naming) by creating a temporary mzid file first even if "mzid_out" is set (which already happens when it isn't), feeding that to the converter, then copying it to the "mzid_out" location if applicable.

timosachsenberg commented 9 years ago

That's how Oliver defined it.

Stortebecker commented 9 years ago

@hendrikweisser Sorry, no more time today for commandline testing. Did you already solve the problem? Otherwise, I can do the testing tomorrow.

lars20070 commented 9 years ago

Works fine with two outputs -out and -mzid_out on the command line.

lars20070 commented 9 years ago

Another issue, why are .canno, .cnlap, .csarr and .cseq written right next to the *.fasta? Can that not happen in the temp folder?

hendrikweisser commented 9 years ago

@lars20070:

Works fine with two outputs -out and -mzid_out on the command line.

Thanks for testing. What about only "mzid_out"?

Another issue, why are .canno, .cnlap, .csarr and .cseq written right next to the *.fasta? Can that not happen in the temp folder?

That's done by MS-GF+ itself, so it's not under our control (I don't think there's a parameter for it). We could make a copy of the FASTA file and use that to have everything in a temporary directory, but: These files are a result of the database indexing, which has to happen once per database, but can take a significant amount of time. So it makes sense to keep the files around for the next search against that database.

timosachsenberg commented 9 years ago

I'd say we keep it as a tool as hiding it doesn help users. Just to keep in mind that util should be used as staging area for further tools.

lars20070 commented 9 years ago

Also fine with single -mzid_out output.

lars20070 commented 9 years ago

Interesting. So a faster search the second time around. Can you mention that in the docu?

And that the adapter only works with >v10089. Perhaps a link to v10089?

cbielow commented 9 years ago

One thing to keep in mind with respect to the index files which are written by MSGF+:

If you run two instances using the same FASTA file and index files do not exist yet, MSGF+ will choke because two processes are trying to write to identical index files. This should be mentioned in the doc's (at least),

or better: Within the adapter before running MSGF+: check the presence of all index files. If not present, create a lock file, call MSGF+ as you would normally and delete lock file after successfull MSGF+ execution. If on the other hand, you find a lock file: wait until it is released (do not return with exit code immediately, since TOPPAS would terminate the other instance, and the lock file would never be deleted).

Alternatively, we could create an extra Tool (MSGFDBIndexer) which you would need to run before calling MSGFAdapter. It does the same as described above. The MSGFAdapter itself would only check for files and never try to run MSGF without them.

Next solution: MSGFAdapter expects index files to be present and quits otherwise, giving you a commandline like: std::cout << "Execute: java -Xmx3500M -cp " << str_MSGFjar_with_full_path << " edu.ucsd.msjava.msdbsearch.BuildSA -d " << fasta_from_Param_with_full_path << " -tda " << TDA_from_Param << "\n"; Since this is not dependent on the working directory, the user can simply open a terminal and copy&paste the code.

Or we just document it (and expect headaches for users)...

Any better ideas?

cheers Chris

Am 2/12/2015 um 11:38 AM schrieb hendrikweisser:

@lars20070:

Works fine with two outputs -out and -mzid_out on the command line.

Thanks for testing. What about only "mzid_out"?

Another issue, why are .canno, .cnlap, .csarr and .cseq written right next to the *.fasta? Can that not happen in the temp folder?

That's done by MS-GF+ itself, so it's not under our control (I don't think there's a parameter for it). We could make a copy of the FASTA file and use that to have everything in a temporary directory, but: These files are a result of the database indexing, which has to happen once per database, but can take a significant amount of time. So it makes sense to keep the files around for the next search against that database.

--- Reply to this email directly or view it on GitHub: https://github.com/OpenMS/OpenMS/issues/1251#issuecomment-74050199

hendrikweisser commented 9 years ago

@cbielow: Good point, I also ran into this problem before. I'm extending the documentation with Lars' suggestions and will also add this.

Someone else feel free to implement the solution with the lock file.

lars20070 commented 9 years ago

I like solution one, @cbielow. But let us put this on the post-2.0-release to-do list and just document the issue for now.

hendrikweisser commented 9 years ago

I've opened a pull request: #1256

Stortebecker commented 9 years ago

Hej, I just compared .mzid outputs from MSGFPlus adapter versus MSGFPlus, both run from commandline. The output is exactly the same! Nice work! However, I could not test it out of TOPPAS, because the .mzid output still does not work in my version. I used a standard search with Carbamidomethyl at C and fixed Dimethylation at N' and K.

hendrikweisser commented 9 years ago

@Stortebecker:

However, I could not test it out of TOPPAS, because the .mzid output still does not work in my version.

What is "your version"? Is it already OpenMS after the fix, together with MS-GF+ v10089?

lars20070 commented 9 years ago

No. @hendrikweisser I simply did not update our server installation to the latest nightly build yet. Should be fine.

lars20070 commented 9 years ago

@cbielow What you described here #1251 (comment) just happened to us during testing. I filed a new issue #1359.