Integration with Sage - Githubissues

lazear commented 1 year ago

Hi,

Is there any interest in integrating additional search engines? Sage is an open source (MIT licensed) search engine that utilizes the ion-indexing approach to run fast narrow & open searches (>5x faster than MSFragger). I would be happy to assist with getting it integrated.

hbarsnes commented 1 year ago

Hi Michael,

Yes! In fact, we have already talked about the option of adding Sage!

I will try to get the time to dig a bit deeper next week and get back to you when I have gotten an overview of how easy it would be to include Sage in SearchGUI (and PeptideShaker). But as far as I remember there did not seem to be any real issues.

Best regards, Harald

hbarsnes commented 1 year ago

Hi again,

I had a closer look at the possibility of adding Sage to SearchGUI and as far as I can see there should not be any major showstoppers. I did however come across a couple of things that ought to be addressed before we can continue.

The first thing I noticed when trying to run Sage was that it seemed to lack any indication of progress? My search failed quite quickly (more on that below), but for longer searches it would be very nice to have some sort of indication of progress? Nothing fancy, just some text showing what the tool is currently doing and ideally a percent indicator. (It could of course be that such information is already displayed after the step where my test came to an end, but would still be great to have some indication as soon as the tool starts as well?)

When going through the json parameters I could not find any way of specifying the enzyme? Seems like trypsin is assumed? Would it be possible to add support for additional enzymes?

Similarly, it would be great to be able to define terminal modifications at specific amino acids? Maybe this is already possible? I did not find any examples in the json comments though?

For the decoy_prefix option, would it be possible to also support suffix annotation? As our headers generally look like this: target: >sp|Q66K14|TBC9B_HUMAN TBC1 domain family member 9B OS=Homo sapiens OX=9606 GN=TBC1D9B PE=1 SV=3 decoy: >sp|Q66K14_REVERSED|TBC9B_HUMAN TBC1 domain family member 9B OS=Homo sapiens OX=9606 GN=TBC1D9B PE=1 SV=3_REVERSED

You write that "MS2 search results will be stored as a Percolator-compatible (.sage.pin) file". Could you share an example of a results file from Sage such that we can verify whether it contains the information we need to process it in PeptideShaker?

Finally, below is the error I got when trying to run the following command line: sage -f uniprot-human-reviewed-trypsin-june-2021_concatenated_target_decoy.fasta -o output config-simple.json qExactive01819.mzML

[2022-10-26T19:20:43Z INFO  sage] generated 248225993 fragments in 44516ms
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!', ', ', src\spectrum.rssrc\spectrum.rssrc\spectrum.rs:::220220220::1414:

I used a standard FASTA file from UniProt and the mzML file is based on a conversion from our standard example raw file. You can find all of the input files here: https://www.dropbox.com/s/9qg5dhsxdf0pqrz/Sage%20test%20files.zip?dl=0

Hopefully all of these points should not be too hard to fix/implement?

Best regards, Harald

lazear commented 1 year ago

Hi Harald,

Thanks for the comments.

A progress indicator could be added... I had one initially, but using a progress bar actually imposes significant overhead on search times. setting the SAGE_LOG=trace environment variable will increase the verbosity of outputs
Trypsin is currently assumed... Rust's regex library does not support PCRE (ie lookaround, backtracking, since these can lead to denial-of-service). I need to figure out a way to specify additional enzymes, likely hard coding them.
N-terminal modifications are currently supported, but not for specific amino acids. I think this should be straightforward
I need to update the decoy_prefix option. Right now, I am using this option to ignore decoy entries from the fasta file. Internal decoy generation is performed by reversing the internal amino acids for each peptide. This enables the picked-peptide approach described in https://www.biorxiv.org/content/10.1101/2022.05.11.491571v2 I will eventually re-enable user-provided decoy's, but this will require disabling Sage's built-in peptide FDR (I perform PSM rescoring using a linear discriminant machine learning model)
I have provided a snippet from an output file: https://gist.github.com/lazear/b52a53c720b425022554bbc490f8c1ee - columns can be added to/changed as needed.
Thanks for providing the files, I will check them out. I have been using both MSConvert & ThermoRawFileParser without issue so far.

I am definitely interested in getting integrated with PeptideShaker - more so than SearchGUI. Seems like less blocking issues too, while I figure out how/if to tackle some of the points raised above.

Best, Mike

hbarsnes commented 1 year ago

Hi Mike,

A progress indicator could be added... I had one initially, but using a progress bar actually imposes significant overhead on search times. setting the SAGE_LOG=trace environment variable will increase the verbosity of outputs

If a progress indicator takes up too much resources, maybe just adding short messages indicating which step that is currently being executed?

I need to update the decoy_prefix option. Right now, I am using this option to ignore decoy entries from the fasta file. Internal decoy generation is performed by reversing the internal amino acids for each peptide.

This is not really a showstopper from our end as we have to recalculate the FDR in PeptideShaker anyway after merging the results from all of the individual search engines. So the most important for us here is actually being able to tell Sage to treat all of the provided sequences in the same way, i.e. disregarding the decoy tags. But I guess that is what happens if simply leaving out the decoy_prefix option?

I have provided a snippet from an output file: https://gist.github.com/lazear/b52a53c720b425022554bbc490f8c1ee - columns can be added to/changed as needed.

Thanks! Seems to have most of what we need. The only thing I could spot missing after a quick check is a way to link back to the original spectrum. For this we need the name of the spectrum file and ideally the spectrum title or index. Perhaps the scan number can be used for the latter, but I would need to test.

We also prefer having more than one hit per spectrum. I see that you do not recommend this in the comments for your json example file, but if we do increase the report_psms option to a number higher than one, how would the ranks be shown in the output file?

Finally, it would be great to have a way to indicate that this file came from Sage and ideally from which version as well?

Best regards, Harald

lazear commented 1 year ago

Sounds good - just to clarify, these are the things that need to be resolved for PeptideShaker integration:

Add in option to override Sage's decoy generation/turn off PSM-rescoring
Add spectrum file name & index or scan number (I think scan number should work. I would prefer to avoid having to keep track of a spectrum title string for perf/memory usage reasons)
Add a column indicating rank of PSM for that scan
Indication of Sage version - Do you suggest just an additional column, or maybe have it encoded in the file name? I am also open to producing other output file formats

Anything else I missed?

hbarsnes commented 1 year ago

Yes, I think that should cover it. Not sure about the scan number. I guess we'll find out when trying to implement it. But maybe you can also keep the index from the mzML headers (instead of the titles/ids)?

To be honest I'm not really sure what the scan number represents, as it does not seem to be a separate tag and rather part of the spectrum id, i.e. title, tag?

Example: <spectrum id="controllerType=0 controllerNumber=1 scan=9155" index="9154" defaultArrayLength="85">

Regarding how to indicate the Sage version I'm not sure what the best option would be. Is there a way of adding comments at the start of a pin file, i.e. before the actual data starts? Or would this result in the pin file not being readable by other tools? Adding a new column may be the easiest option as it can simply be ignored by other tools. In any case, it is more important to be able to tell that the pin file came from Sage than the exact version number. At the moment there is nothing in the current file that can be used for this I think? Would it make sense to add Sage to any of the current column headers? If this information cannot easily be added, we should still be able to parse the file, we just would not be able to tell which algorithm that was used to generate it and will simply have to call it "generic pin file format" or something like that in PeptideShaker.

lazear commented 1 year ago

I pulled out the scan number because it was required for percolator/mokapot input ("ScanNr"), and is also reported by Proteome Discoverer, MSFragger, Comet, etc - this enables easy comparison of results from multiple search engines (using Pandas, etc). I do extract it from the spectrum id field - it seems like generally spectrum index is the 0-index based scan #, but I'm sure this can't be relied upon 😄

I can change the discriminant_score column to something like sage_discriminant_score

hbarsnes commented 1 year ago

Here are the definitions from the mzML specification document:

id - xs:string (pattern: \S+=\S+( \S+=\S+)*) - The native identifier for a spectrum. For unmerged native spectra or spectra from older open file formats, the format of the identifier is defined in the PSI-MS CV and referred to in the mzML header. External documents may use this identifier together with the mzML filename or accession to reference a particular spectrum.

index - xs:nonNegativeInteger - The zero-based, consecutive index of the spectrum in the SpectrumList.

To me it seems like the index ought to be included in the pin file as it is the only guaranteed way of referring back to a given spectrum in the mzML file. Actually, it would also be great to have the name of the mzML file in a separate column so that we do not have to guess this from the name of the pin file?

I can change the discriminant_score column to something like sage_discriminant_score

That would work. :)

lazear commented 1 year ago

That is a fair point - clearly, though, scan numbers can be used to uniquely refer to spectra, given that they are the preferred indexType in the USI format (https://raw.githubusercontent.com/HUPO-PSI/usi/master/Specification/USI_SpecDoc_1.0.draft10_2021-05-20.pdf). They do also note this complication with some instruments (appears to be just AB SCIEX?) not reporting scan numbers.

Currently, Sage will default to using the spectrum index + 1 as the internal "scan number" if the scan field isn't available. This doesn't seem like correct behavior, but acts as a work around if someone uses a SCIEX instrument, I suppose (which I don't think has happened yet...).

I will look into the performance cost of keeping track of the entire spectrum ID string, if it's acceptable, then I will implement that!

lazear commented 1 year ago

[X] Use spectrum title/ID instead of extracing scan number
[X] Rename discriminant_score to sage_discriminant_score
[X] Add a column indicating the rank of a PSM for a given spectrum

Here are a few lines from an example file - let me know if this works! https://gist.github.com/lazear/822a25d612960efa772db402364ffdd4

lazear commented 1 year ago

Also, regarding the above error:

Finally, below is the error I got when trying to run the following command line: sage -f uniprot-human-reviewed-trypsin-june-2021_concatenated_target_decoy.fasta -o output config-simple.json qExactive01819.mzML [2022-10-26T19:20:43Z INFO sage] generated 248225993 fragments in 44516ms thread 'thread 'thread '' panicked at '' panicked at '' panicked at 'missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!', ', ', src\spectrum.rssrc\spectrum.rssrc\spectrum.rs:::220220220::1414: I used a standard FASTA file from UniProt and the mzML file is based on a conversion from our standard example raw file. You can find all of the input files here: https://www.dropbox.com/s/9qg5dhsxdf0pqrz/Sage%20test%20files.zip?dl=0

This is occurring because there is no <cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="3"/> term under the <selectedIon> parent term... Not sure how best to handle precursors without charge state information, I suppose search charge states 2-5 and merge the PSMs lists.

hbarsnes commented 1 year ago

Here are a few lines from an example file - let me know if this works!

Thanks! I will look at it as soon as possible. I'll keep you posted.

Not sure how best to handle precursors without charge state information, I suppose search charge states 2-5 and merge the PSMs lists.

Yes, I think that makes sense and is in line with what other search engines seem to do.

hbarsnes commented 1 year ago

Here are a few lines from an example file - let me know if this works!

Yes, it does! I've now implemented a parser for Percolator pin files in PeptideShaker and it is able to load your short example file!

I've deployed a beta version that you can test here: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

I have only tested it on the small example file yet though, so there are probably still undetected issues. Note also that the pin file parser currently only supports your specific pin file format and requires that the following column headers are used: peptide, charge, posterior_error, rank, scannr and filename (in any order). (A column named sage_discriminant_score is also needed to detect that the file was generated by Sage, but this is not mandatory.)

Please try the beta version on a complete Sage results file and let me know how it goes?

lazear commented 1 year ago

I got it to load! Couple issues/comment:

I don't see any peptides with C[+57.0214] modifications - setting it as a variable mod in PeptideShaker seems to fix this though
~~Decoys aren't loading (I think we knew this would happen though, since Sage is currently digesting/reversing in memory).~~
Sage can also ID peptides with a -1 isotope error, this doesn't seem supported
Is it possible to display a discriminant_score column or otherwise? Each engine reports some score, but I don't see it (I might just be missing it though)

N-term mods

Looks like there is a failure with N-terminally modified peptides (out of bounds) here: https://github.com/compomics/compomics-utilities/blob/0a75023c9ba7b4bb6676460b70005ec0ccee88a8/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L259

I'm using ProForma suggested notation for modifications now, so will have something like:

[+229.1629]-VC[+57.0215]PAPC[+57.0215]EGAC[+57.0215]TLGIIEDPVGIK[+229.1629]


java.lang.StringIndexOutOfBoundsException: Index -1 out of bounds for length 35
    at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
    at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
    at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
    at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
    at java.base/java.lang.String.checkIndex(String.java:4557)
    at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46)
    at java.base/java.lang.String.charAt(String.java:1515)
    at com.compomics.util.experiment.io.identification.idfilereaders.PercolatorInputfileReader.getAllSpectrumMatches(PercolatorInputfileReader.java:259)
    at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
    at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
    at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
    at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
    at java.base/java.lang.Thread.run(Thread.java:833)`
``

lazear commented 1 year ago

I've added support for PeptideShaker/SearchGUI style database, and an option to disable internal decoy generation so we are good on that front - looks absolutely great, numbers are matching what I expect, etc!

One additional bug/quirk: I noticed that when reporting multiple PSMs for the same spectrum (e.g. report_psms: 2), PeptideShaker reports about half as many passing peptides, proteins, and PSMs. (20k PSMs -> 9k PSMs) compared to reporting a single PSM per spectra. I assumed the rank column is 1 = best, 2 = second best, etc?

hbarsnes commented 1 year ago

I don't see any peptides with C[+57.0214] modifications - setting it as a variable mod in PeptideShaker seems to fix this though

By default fixed modification are not shown. You can change this in the View > Fixed Modifications menu option.

Sage can also ID peptides with a -1 isotope error, this doesn't seem supported

You would have to set this in the search settings before loading the data. You'll find it in the same dialog has where you set the PTMs. See the Isotopes option in the lower right-hand corner. The default is 0-1.

Is it possible to display a discriminant_score column or otherwise? Each engine reports some score, but I don't see it (I might just be missing it though)

The only place that you will see the search engine scores will be in the Spectrum IDs tab and only if also checking the View > Scores menu option. However, at the moment I'm using the posterior_error value and not the sage_discriminant_score. Do you want me to change this?

Looks like there is a failure with N-terminally modified peptides

Indeed, as I did not have any such PTMs in the short example file. Should not be too hard to add the required support though. I will look into it a deploy a new version.

One additional bug/quirk: I noticed that when reporting multiple PSMs for the same spectrum (e.g. report_psms: 2), PeptideShaker reports about half as many passing peptides, proteins, and PSMs. (20k PSMs -> 9k PSMs) compared to reporting a single PSM per spectra. I assumed the rank column is 1 = best, 2 = second best, etc?

I will need to look deeper into this later. But yes, rank 1 is best, rank 2 is second best etc. Note that we only retain the best one though (except for in the Spectrum IDs tab). The additional hits are mainly there for better comparison between search engines, or so that we can overrule the search engine if we don't agree with the best hit or or if more than one hit have the same score.

hbarsnes commented 1 year ago

I've just redeployed the beta version which should now also support terminal PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

lazear commented 1 year ago

For cysteine mod, the peptides aren't making it through the import filter:

Tue Nov 01 07:28:03 PDT 2022 Importing PSMs from search.pin Tue Nov 01 07:28:05 PDT 2022 1118 identified spectra (2.7%) did not present a valid peptide. Tue Nov 01 07:28:05 PDT 2022 9916 of the best scoring peptides were excluded by the import filters: Tue Nov 01 07:28:05 PDT 2022 - 11.3% peptide mapping to both target and decoy. Tue Nov 01 07:28:05 PDT 2022 - 88.7% unrecognized modifications.

I suppose PS is looking for peptides of the form "PEPCA" not "PEPC[+57.0214]A" with fixed modifications?

posterior_error

No, this should be better/more stable across runs, thanks!

I'll test out with N-term mods in a bit.

hbarsnes commented 1 year ago

I suppose PS is looking for peptides of the form "PEPCA" not "PEPC[+57.0214]A" with fixed modifications?

We're just parsing all of the modifications provided in the peptide sequences. These should later be mapped to the correct modifications, i.e. comparing to the modifications set in the search settings. But perhaps something is going wrong there. Can you share the pin file so that I can test?

No, this should be better/more stable across runs, thanks!

So stick with posterior_error?

hbarsnes commented 1 year ago

We're just parsing all of the modifications provided in the peptide sequences.

Seems like we currently treat all modifications in pin files as variable. Hence, I need to detect the fixed ones and treat them differently. I'll see what I can do and deploy a new version.

lazear commented 1 year ago

Here is a link to the PeptideShaker project zip file - can use this mzml file for additional testing too: https://www.dropbox.com/s/kp057sc4n9t92mj/b.psdb?dl=0

And yes, I should've been more clear - stick with posterior_error!

hbarsnes commented 1 year ago

Here is a link to the PeptideShaker project zip file

Seems to only contain the psdb file? Better if you could share the pin file and the search settings used (I already have the mzml file)? That way I can try to load the data myself and hopefully fix the fixed modification issue.

hbarsnes commented 1 year ago

I've just redeployed the beta version which should now also support fixed PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

lazear commented 1 year ago

Here is a link to the PeptideShaker project zip file

Seems to only contain the psdb file? Better if you could share the pin file and the search settings used (I already have the mzml file)? That way I can try to load the data myself and hopefully fix the fixed modification issue.

Whoops... copied the wrong file. Here is a link for the zipped results files/parameters: https://www.dropbox.com/s/9lyf2qoobwhe7t3/Archive.zip?dl=0

hbarsnes commented 1 year ago

Whoops... copied the wrong file. Here is a link for the zipped results files/parameters: https://www.dropbox.com/s/9lyf2qoobwhe7t3/Archive.zip?dl=0

Thanks! As far as I can tell it all seems to be working fine now? At least I'm able to load this particular file without any issues. :)

lazear commented 1 year ago

It seems to be working for me as well! 😄 I've been able to load TMT-modified, as well as unmodified datasets, and also load in Sage results with other engine results produced by SearchGUI.

Thanks for getting this done so rapidly!

I wonder if it would be better for me to produce something like a ".sage.tsv" file instead of a generic percolator PIN file, given that most search engines won't be writing a "posterior_error"/PEP column until after percolator has been run (I don't think I'm strictly adhering to the PIN format either... it's kind of morphed over time). Not a big issue either way. I think the ability to load in these simply formatted files will be beneficial for enabling comparisons with other engines as well, like Proteome Discoverer.

One more thing: I noticed that the Score column always displays "100" regardless of the underlying raw score, perhaps this is a consequence of the posterior_error value always being negative?

hbarsnes commented 1 year ago

It seems to be working for me as well! 😄 I've been able to load TMT-modified, as well as unmodified datasets, and also load in Sage results with other engine results produced by SearchGUI.

Great! Then all that remains is to also add Sage to SearchGUI? :)

I wonder if it would be better for me to produce something like a ".sage.tsv" file instead of a generic percolator PIN file

Probably, as our current parser is rather Sage-specific anyway and not really a generic pin file parser.

One more thing: I noticed that the Score column always displays "100" regardless of the underlying raw score, perhaps this is a consequence of the posterior_error value always being negative?

Not sure, could be. If you look at the validation plots in the Validation tab there seems to be a very clear separation between the good scoring target hits and the bad scoring decoy hits. At least at the protein and peptide level. The PSM level is however quite strange with only hits scoring 100 for both target and decoy. Which is very different from the PeptideShaker example dataset. Not really sure why? However, the Score/Confidence column is often, but not always, displaying 100 for the PeptideShaker example dataset as well. I think I will have to forward this question to my co-developer Marc Vaudel as this is more his field of expertise. @mvaudel Any comments?

lazear commented 1 year ago

Great! Then all that remains is to also add Sage to SearchGUI? :)

I think so - I will get started on the remaining issues. I think additional enzyme support is probably the biggest blocker?

Probably, as our current parser is rather Sage-specific anyway and not really a generic pin file parser.

I will do this then. I noticed this line of code: https://github.com/compomics/compomics-utilities/blob/d5e88a13d298d9b159b37a93ba1e04124693c730/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L280

This is, I believe, incorrect behavior. There should be a check to see if charAt(i-2) is a dash (indicating a C-terminal mod... not supported yet). For instance, the peptide [+304.2071]-PEPTIDMK[+304.2071] would parse as M having the +304.2071 mod (assuming that was a variable mod). This should be distinct from [+304.2071]-PEPTIDMK[+304.2071]-[+304.2071] (C-terminal modification)

This hasn't bitten me yet, and I don't have any test files with such cases yet, but I imagine it will happen at some point.

Not sure, could be. If you look at the validation plots in the Validation tab there seems to be a very clear separation between the good scoring target hits and the bad scoring decoy hits. At least at the protein and peptide level. The PSM level is however quite strange with only hits scoring 100 for both target and decoy.

Yeah, that's exactly what I noticed as well

mvaudel commented 1 year ago

Wow this is all very impressive, both in quality and speed.

@mvaudel Any comments?

If my memory is correct (this is 10+ years old code) the score column on the GUI requires the score to distribute like a PEP and displays 100*(1-PEP), i.e. a good score should be close to zero and then the column displays 100, and a bad score should be close to 1 and then the column displays 0.

hbarsnes commented 1 year ago

I think additional enzyme support is probably the biggest blocker?

Yes, I think so. Would also be great to get the precursors without charge state information issue addressed? (As this would allow us to also search our (old) standard example dataset.)

I'm now fully booked with other things until the end of this week. I'll move on to the SearchGUI support over the weekend.

There should be a check to see if charAt(i-2) is a dash (indicating a C-terminal mod... not supported yet).

Yes, you are correct. This was indeed a bug. I made a slightly different fix that seems to work. Redeployed at the same link: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip.

lazear commented 1 year ago

Great - no rush! I will let you know when initial support for enzymes & charge-less precursors is done, will probably be next week for me as well.

Re: Scoring, Sage is reporting the posterior_error (PEP) already log10-transformed, so should be an easy transform to get it back to what's expect.

lazear commented 1 year ago

Hi @hbarsnes, I'm ready for round 2 when you are! Enzyme support & charge-less searching have been merged and released.

hbarsnes commented 1 year ago

Hi @hbarsnes, I'm ready for round 2 when you are! Enzyme support & charge-less searching have been merged and released.

Great! Thanks for the update! I'm still stuck with other duties, but I have managed to make some progress on adding Sage to SearchGUI. Hopefully I'll get the time to complete the work later this week. I will keep you posted.

hbarsnes commented 1 year ago

I finally got some time to work on adding Sage to SearchGUI and already have the command line set up and running from inside SearchGUI and the results opening in PeptideShaker. Still need to add the custom parameters and set up the enzyme and modification mapping, but no major problems uncovered so far.

I did however come across a couple of minor issues that you can have a look at while I complete the integration.

It would be great if the output file could have the name of the spectrum file instead of the generic name results.sage.tsv? In other words, replacing "results" with the name of the mzml file (without the mzml extension).
It seems like only enzymes cleaving after a given set of amino acids is currently supported? However, there are also enzymes cleaving before a given set of amino acids, e.g., Asp-N or Lys-N. Would it be possible to support such enzymes as well? And how about the special cases of no cleavage (for with a peptide database as input) or unspecific cleavage (i.e. cleave everywhere)?
Printing the json parameters to the command line might be useful as a quality control when run from the command line but look more out of place when run from inside SearchGUI (as we simply display whatever the search engine sends to the standard output). Would it be possible to either not display this information or add an option to turn it off?
I found the example "^": 304.207 to add modifications for the n-term, but what about the c-term?
The enzyme annotation in your example json file in your usage section fails with the following error: "Error: trailing comma at line 10 column 3." After lots of testing I figured out that the comma after "restrict": "P", ought to be removed and instead added at the end of the next line, i.e. },. :)

lazear commented 1 year ago

I'm using the generic name because Sage will combine results from all of the files run in the same search (PSM-rescoring is performed globally). The spectrum file names are located in a column in the tsv
Should be possible to support N-term cleavages. I will look into no cleavage/K-mer cleavages as well
Can do
Will also add this to feature list
Those pesky trailing commas... thank you!

hbarsnes commented 1 year ago

I'm using the generic name because Sage will combine results from all of the files run in the same search (PSM-rescoring is performed globally). The spectrum file names are located in a column in the tsv

Ah, ok, that makes sense. Not sure how to make this fit with the other search engines though, as they all provide one output file per spectrum file (or at least that is how we use them). Basically, we run one command line per search engine per spectrum file (as the results are later merged in PeptideShaker anyway). But if we do that for Sage, I guess we would end up overwriting the result files and only be left with the results from the final spectrum file (given that the output folder would be the same). I will see what I can do, but l will first focus on getting it all to work for a single spectrum file.

hbarsnes commented 1 year ago

A first beta version of SearchGUI with Sage support is now available here: https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-windows.zip and https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-mac_and_linux.tar.gz

To open the data in PeptideShaker you will also need to update your PeptideShaker beta version: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

You will also have to make a new search parameters file in SearchGUI (will look into auto updating old parameter files later).

Note that I've not yet had the time to do much real testing, but thought I'd share these anyway so that at least you can check them out and perhaps discover a bug or two. However, from the tests I've done, it all seems to be working fine. But there is most likely still lots of finetuning required.

Note also that I have not yet found a way to easily include multiple linux and osx builds of Sage. Partly as I was not sure which versions to include? Do we really need all of the four linux builds in SearchGUI?

lazear commented 1 year ago

Woo! I will give it a try - also working on some of the additional features as well.

I think for Linux, just "sage-v0.7.1-x86_64-unknown-linux-gnu.tar.gz" - I will probably remove the pre-compiled musl builds since I don't think anyone is using them? aarch64 builds are mostly for when I am running Sage on cheaper AWS gravitron (ARM) instances - I doubt many people are running aarch64 linux with a GUI.

For Mac, I would perhaps include both versions? aarch64 is required for M1 support, x86_64 for anyone not on M1 hardware (I think? I only have an M1).

hbarsnes commented 1 year ago

I think for Linux, just "sage-v0.7.1-x86_64-unknown-linux-gnu.tar.gz"

I think that was the one I included, but in any case, I'll make sure that it is the one included in the next beta release.

For Mac, I would perhaps include both versions?

Yes, I think that would be the best option. Just need to find an easy way to figure out which one to run on a given system. Again, I'm not sure which version I included in the current beta release, but I guess you will find that out quickly enough when you try to test it.

BTW, if you want to quickly replace the Sage version included in SearchGUI you can simply replace the given sage executable found in the subfolders under SearchGUI-4.2.0-beta\resources\Sage.

Oh, and if you are not too familiar with SearchGUI, if you want to change the Sage-specific parameters, you will find these by clicking the cogwheel to the right of where you turn on/off each individual search engine. (All of the options will also later be added here: https://github.com/compomics/compomics-utilities/wiki/IdentificationParametersCLI.)

lazear commented 1 year ago

Couple quick notes on parameters:

It doesn't seem like the "decoytag" option is being set - I have it set to "_REVERSED" in the SearchGUI options, but the output log still shows "rev\" (which is the default setting).
"Parallel Search" option only has an effect when multiple mzML files are searched - no need to expose this to users if SearchGUI is going to search files one-at-a-time

hbarsnes commented 1 year ago

It doesn't seem like the "decoy_tag" option is being set - I have it set to "REVERSED" in the SearchGUI options, but the output log still shows "rev" (which is the default setting).

This seems to be an error in Sage and not SearchGUI? As I've just double checked that the json file created by SearchGUI does include the decoy tag set in the SearchGUI options. Example: sage.txt. (Renamed to .txt so that it could be uploaded to the issue tracker.)

I've also tested with generate_decoys set to both true and false, but this does not seem to make any difference either.

"Parallel Search" option only has an effect when multiple mzML files are searched - no need to expose this to users if SearchGUI is going to search files one-at-a-time

Ok, good to know. I think we'll just keep it anyway as it's already implemented. And then we'll have it available if we change SearchGUI so that it searches multiple files at the same time at a later stage.

lazear commented 1 year ago

It doesn't seem like the "decoy_tag" option is being set - I have it set to "REVERSED" in the SearchGUI options, but the output log still shows "rev" (which is the default setting).

This seems to be an error in Sage and not SearchGUI? As I've just double checked that the json file created by SearchGUI does include the decoy tag set in the SearchGUI options. Example: sage.txt. (Renamed to .txt so that it could be uploaded to the issue tracker.)

I've also tested with generate_decoys set to both true and false, but this does not seem to make any difference either.

The file you sent has "decoy_prefix" set, not "decoy_tag"! I changed the option name a while back to make it more clear that it's not limited to prefix.

I just tried running SearchGUI on OSX (was on Windows yesterday), and I can no longer open the search settings (cog) for Sage, but I can for other tools, which causes NPE

java.lang.NullPointerException: Cannot invoke "com.compomics.util.parameters.identification.tool_specific.SageParameters.getPerformLfq()" because "this.sageParameters" is null

hbarsnes commented 1 year ago

The file you sent has "decoy_prefix" set, not "decoy_tag"! I changed the option name a while back to make it more clear that it's not limited to prefix.

Aha! Nice catch! Fixed and new beta versions deployed: https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-windows.zip and https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-mac_and_linux.tar.gz

I just tried running SearchGUI on OSX (was on Windows yesterday), and I can no longer open the search settings (cog) for Sage

Does this happen when using previously created settings or also new ones created in the latest beta release? As for the moment you cannot use the old settings as I did not get the time to make this backwards compatible yet.

hbarsnes commented 1 year ago

@lazear I noticed that you had released a new version of Sage with support for more modification types. I therefore updated our code to take advantage of this and released new beta versions of SearchGUI as well. Same links as before.

lazear commented 1 year ago

Awesome, thanks. I'm busy with some other stuff this week, but I will get around to additional enzymes soon!

lazear commented 1 year ago

@hbarsnes do you have any favorite test data sets including enzymatic cleavage with N-specific enzymes or any non-specific digestion? I think I'm ready for the next phase of testing!

hbarsnes commented 1 year ago

do you have any favorite test data sets including enzymatic cleavage with N-specific enzymes or any non-specific digestion?

No, I'm afraid not. We generally just test with tryptic digestion ourselves. You can probably find some in PRIDE though?

I think I'm ready for the next phase of testing!

Great! Looking forward to testing the new Sage versions!

mknierman commented 1 year ago

Just ran across sage today. Wow it is fast. I thought it would be nice to have this in SearchGUI and I ran across this thread! I downloaded the beta and am trying it out. What can I do to help?

lazear commented 1 year ago

@mknierman thanks! I always welcome any contributions/bug reports/etc. Right now, I just need to get some free time to do some more testing on the next round of updates to Sage - hopefully in the next couple of weeks.

hbarsnes commented 1 year ago

@mknierman Thanks for the encouragement! Feel free to test the current beta version and let us know if you come across any issues. Then I will let you know when there is a new beta versions to test.

compomics / searchgui

Integration with Sage #334

N-term mods