Proteobench / ProteoBench

ProteoBench is an open and collaborative platform for community-curated benchmarks for proteomics data analysis pipelines. Our goal is to allow a continuous, easy, and controlled comparison of proteomics data analysis workflows.
https://proteobench.readthedocs.io
Apache License 2.0
27 stars 7 forks source link

integration of i2MassChroQ results in DDA-Quantification-ion-level module #229

Closed OlivierLangella closed 2 months ago

OlivierLangella commented 5 months ago

Hi, is it possible to add support for the ion quantification as exported by i2MassChroQ ?

I've run i2MassChroQ on the supplied dataset (6 files, 2 conditions) as follow : first, conversion to mzML using ThermoRawFileParser.

and then identification with X!Tandem Alanine 2017.1.4,

parameter value
Maximum number of missed cleavages 1
PSM FDR 0.009
Endopeptidase Trypsin/P
Fixed modifications Carbamidomethylation (C)
Variable modifications Oxidation (M), Acetyl (Protein N-term)
Precursor mass tolerance 10 ppm
Fragment mass tolerance 0.02 Da
Minimum peptide length N.A.

Identifies peptides were quantified using MassChroQ 2.4.25. It produces 2 TSV files, one that contains protein ids and peptides ids correspondences and one containing peptide, charge, isotope number, area under the curves quantities in each samples.

Can I send you those files somewhere ?

The difference with what is mentioned as required in the proteobench documentation is that this contains all quantified isotopes.

There is no normalization in this table, no data cleaning, no shared peptide removal...

If it is not possible, I can try to generate the custom format as described in the documentation.

Thanks for your time and help Olivier

mlocardpaulet commented 5 months ago

Thanks a lot! How do you suggest to deal with the ion redundancy due to the presence of all isotopes? Currently, when there are several quantities (or rows in the table) for one precursor ion, we calculate the sum of signal. Would this work fo these data?

OlivierLangella commented 5 months ago

You're welcome ! To deal with isotopes, I take for each peptide the theoretically most abundant. The computed theroretical ratio is written in the last column "niratio" of the peptides_q1_All_samples.tsv file. This gives the best results.

I think that using the sum of the signal would lead to less efficiency. In this case, to simplify parsing, it is better to take the monoisotope only : "ninumber" == 0 && "nirank" == 1

An other issue I've seen is for the PTMs : In my results, there is sequence redundancy if the same peptide was quantified at different MH+. I've seen an issue about it #144 . Did you find a solution ?

Perhaps, I can code something to get a tailored output for proteobench inside i2MassChroQ ? Thanks Olivier

OlivierLangella commented 5 months ago

For the PTM issue : the column "peptide" is a unique identifier for the sequence+PTM mass

mlocardpaulet commented 5 months ago

Right now, for the module DDA quantification - precursor ions, we use one row per precursor ion. And a precursor ion is sequence + localised modification(s) [ideally in proforma format] + charge So it is perfect for your output. Regarding the isotope selection, you know best. It just has to be clearly described in the documentation, where we can include a paragraph on MassChroQ outputs :).

One other thing that could be complicated: the parameter file. The way I see it, it should be relatively easy to have your data compatible with ProteoBench for local visualization (addition of your point to the plot). But for public submission (if the user wants the point to be visible by all), we will require upload of some parameter files that contain parameters of interest, which would include search parameters (you can see some when you hover over the plot). Here is an example from one of the current public points:

 "software_name":"MaxQuant",
    "software_version":"1.5.3.30",
    "search_engine_version":null,
    "search_engine":"Andromeda",
    "ident_fdr_psm":null,
    "ident_fdr_peptide":0.01,
    "ident_fdr_protein":0.01,
    "enable_match_between_runs":"false",
    "precursor_mass_tolerance":"4.5 ppm",
    "precursor_mass_tolerance_unit":null,
    "fragment_mass_tolerance":"20 ppm",
    "fragment_mass_tolerance_unit":null,
    "enzyme":"Trypsin\/P",
    "allowed_miscleavages":1,
    "min_peptide_length":7,
    "max_peptide_length":null,
    "fixed_mods":"Carbamidomethyl (C)",
    "variable_mods":"Oxidation (M),Acetyl (Protein N-term)",
    "max_mods":5.0,
    "min_precursor_charge":null,
    "max_precursor_charge":7.0

Is there an output from your tool where we could retrieve these information? It can be several files.

mlocardpaulet commented 5 months ago

OK, one thing I forgot: we need to match the protein identifiers to the peptide ions. It is easy but right now we upload only one file. So maybe we could discuss having a ProteoBench-compatible output from i2MassChroQ. If you don't mind. Let's discuss it in a meeting.

OlivierLangella commented 5 months ago

Right, we'll see it tomorrow. I've added a ProForma column in the results : this was already available in MassChroQ but not used in output files. To generate only one file, I've made a specific output for MS-Stats. It uses a simple R script that merges tables and automatically select the most abundant isotope, so this can be used to build a proteobench-compatible output.

Thanks !

mlocardpaulet commented 5 months ago

Hi all, here is a quick recap of the meeting:

  1. in the end, the direct output of MassChroQ may not be the best input for ProteoBench because there is no normalisation or post-processing. -> so we should consider using the output after post-processing with MSSTAT or MCQR (both integrated in i2MassChroQ - @OlivierLangella you correct me if I am wrong)
  2. @OlivierLangella will run his tool on ProteoBench data, and generat a tab-delimited output that contains one row per quantified ion with:
    • sequence
    • modified sequence (ProForma)
    • raw file (if the raw file name is modified we will include it in the .toml)
    • charge
    • proteins (containing "SPECIES" and separated with ";" if it is a protein group
  3. @OlivierLangella will also provide a parameter file that contains all the parameters that he thinks would be interesting. @Henry will look into how to parse it for public submission. I think that I did not forget anything, please let us know if I did. If we have all this, it should not be much work to have it implemented soon :). One question that we have for @RobbinBouwmeester: the output will be in long format, but I don't know where we should put this information. I think that currently, we have long formats as well as wide formats (or am I wrong?), but it is not indicated in the .toml. Shall we add this information somewhere?
OlivierLangella commented 5 months ago

Thank you very much @mlocardpaulet ! So I have now a "ProteoBench" export button that produces the data file as mentioned : raw file with mzML extension, sequence, ProForma, charge, proteins (species and ; for groups). Normalisation is made using the MCQR package after the raw quantification made by MassChroQ.

The first "proteobench_export.tsv" is available in https://cloud.cmb.ugent.be/index.php/s/zdGB3zZ7Fwed9gq?path=%2FModule_2_DDA_quantification%2Fsearch_results%2Fi2MassChroQ%2Fresult_proteobench_2pep_fdr09.d

Important parameters are :

parameter value
identification engine X!Tandem 2017.2.1.4
protein inference i2MassChroQ 1.0.6
post processing MCQR 0.6.11
Maximum number of missed cleavages 1
PSM FDR 0.009
Endopeptidase Trypsin/P
Fixed modifications Carbamidomethylation (C)
Variable modifications Oxidation (M), Acetyl (Protein N-term)
Precursor mass tolerance 10 ppm
Fragment mass tolerance 0.02 Da
Minimum peptide length N.A.
Normalisation median.RT

I can produce a file with those parameters, it is also possible to produce a single ODS file combining the parameters and peptide quantifications.

how about the .toml ?

Thank you very much Olivier

mlocardpaulet commented 4 months ago

@OlivierLangella: what field should we use for the quantification? I suspect areanorm?

OlivierLangella commented 4 months ago

Sorry, yes that is "areanorm" Thanks ! Olivier

mlocardpaulet commented 4 months ago

OK, thanks a lot. I'll have a look at the .toml and the input of the ions table (probably not today). @enryH could you have a look at the parameter parsing? Or give me pointers?

enryH commented 4 months ago

Yes sure. It's simply to read in a toml file as a dictionary. So this will be straight forward. I'll make a proposal and then you can comment on my assignments based on the information above;)

RobbinBouwmeester commented 4 months ago

So the parsing and plotting is done! Now we only need the parser for the search engine parameters (@enryH)

mlocardpaulet commented 4 months ago

Hello @OlivierLangella, we have made your outputs compatible with ProteoBench. We noticed that the modified peptides are duplicated in the column "ProForma" (here is an example: VPDAVGKC[MOD:00397]R;VPDAVGKC[MOD:00397]R;VPDAVGKC[MOD:00397]R). Could you please remove this redundancy? We could do it, but I think that it would be cleaner if you do it on your side. What do you think?

mlocardpaulet commented 4 months ago

Regarding the parameters, @OlivierLangella I don't see any information regarding match between run. Maybe you could add some? At least a TRUE value if there is match between run.

mlocardpaulet commented 4 months ago

also, @enryH I did not create any tests for this format. I am happy to do it but wouldn't mind a short tour or what to modify (never really been involved in the tests yet)

enryH commented 4 months ago

Is the parameter file already uploaded somewhere?

mlocardpaulet commented 4 months ago

Is the parameter file already uploaded somewhere?

Actually no. I don't think that we have one. @OlivierLangella could you send us a parameter file?

mlocardpaulet commented 4 months ago

Last point: I have not written the documentation on how to use i2MassChroQ to get compatible export yet, but I would be happy to discuss it with @OlivierLangella when you have some time.

mlocardpaulet commented 4 months ago

I have this issue now that it is on the server, and it seems to be only with i2MassChroQ output. It seems to be connected to the toml file? Maybe @julianu would have an idea of where the problem comes from? It does not crash when I run the main locally. image And it does not add the point.

mlocardpaulet commented 4 months ago

@OlivierLangella here is the plot (plotted locally - green point). Numbers are really high, it is filtered at 1% FDR, right? image

OlivierLangella commented 4 months ago

Hello Marie, sorry for the late response, I'm a bit busy and I'll be on vacation next week. Yes this has been filtered using using a psm FDR threshold of 0.8998%, protein FDR of 0.9873%, peptide FDR 1.1963%.

The formula used to compute FDR is #decoy / #target .

I've fixed the ProForma redundancy problem (uploaded in nextcloud).

mlocardpaulet commented 4 months ago

OK, then we are good to go for local visualization. It is broken in the current version on the server, but it will be fixed in the next and it is corrected in the main. You can already try it locally. For the parameters, @enryH I suspect that you can find the file MassChroQ informations - q1 on the cloud. Maybe we'll need to check if all the information that we want are there.

mlocardpaulet commented 4 months ago

Hello Marie, sorry for the late response, I'm a bit busy and I'll be on vacation next week. Yes this has been filtered using using a psm FDR threshold of 0.8998%, protein FDR of 0.9873%, peptide FDR 1.1963%.

The formula used to compute FDR is #decoy / #target .

I've fixed the ProForma redundancy problem (uploaded in nextcloud).

This is totally fine, no hurry. One question regarding FDR: does protein-level validation impact the set of ions that are reported in the ProteoBench output that we have? Are the PSMs from non-validated proteins removed? If it is the case, we may also want to keep track of all the FDRs in your parameter file. The parameter files should contain as many parameters as possible. We can then choose which ones we want to indicate next to the plot in ProteoBench, but for public submission, it would be great to have also a parameter file that contains everything and that we can go back to if needed to understand differences between pipelines.

enryH commented 4 months ago

OK, then we are good to go for local visualization. It is broken in the current version on the server, but it will be fixed in the next and it is corrected in the main. You can already try it locally. For the parameters, @enryH I suspect that you can find the file MassChroQ informations - q1 on the cloud. Maybe we'll need to check if all the information that we want are there.

Nope. The file contains the following information.

MassChroQ version   2.4.25

Alignment parameters
alignment group id  All_samples
alignment group reference XML id    LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_02
alignment group reference msrun filename    /gorgone/pappso/moulon/raw/20240131_proteobench/LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_02.mzML
MS1 smoothing half window   0
MS2 smoothing half window   15
MS2 tendency half window    10

XIC parameters
extraction range lower limit    10 ppm
extraction range upper limit    10 ppm
natural isotope minimum abundance   0.8
matching mode   post_matching

Filters
antiSpike|5
Detection method
detection Zivy
smoothing half edge window  1
maxmin half edge window 3
minmax half edge window 4
detection threshold on maxmin   3000
detection threshold on minmax   5000

And that's different from what @OlivierLangella reported previously here:

Important parameters are :

parameter value identification engine X!Tandem 2017.2.1.4 protein inference i2MassChroQ 1.0.6 post processing MCQR 0.6.11 Maximum number of missed cleavages 1 PSM FDR 0.009 Endopeptidase Trypsin/P Fixed modifications Carbamidomethylation (C) Variable modifications Oxidation (M), Acetyl (Protein N-term) Precursor mass tolerance 10 ppm Fragment mass tolerance 0.02 Da Minimum peptide length N.A. Normalisation median.RT

OlivierLangella commented 4 months ago

Thanks a lot @mlocardpaulet and @enryH !

You're totally right concerning the need to have as much as possible informations about how inference was done, thresholds ... etc. Indeed, in the information file already uploaded in nextcloud, there are all needed informations regarding the quantification, but it lacks informations about protein inference and validation.

At this time, those informations are displayed in the GUI, and exported tables from i2MassChroQ, but my goal is to produce a single parameter file containing all needed things, exported automatically from the "proteobench button" Sans titre

One question regarding FDR: does protein-level validation impact the set of ions that are reported in the ProteoBench output that we have? Are the PSMs from non-validated proteins removed?

Only validated proteins and PSMs are processed for quantification. PSMs from non validated proteins are removed, unless they are shared by an other validated protein AND they also are valid at the PSM level.

I'e joined to the nextcloud directory the full list of tandem presets used for identification and the full exported tables from i2MassChroQ ("proteobench_2pep_fdr09" directory), containing parameters for validation (i2MassChroQ information.tsv).

There is also the XML project file "proteobench_2pep_fdr09.xpip" that you can use to reload all project in i2MassChroQ.

One last thing I forgot to mention, yes MBR is used for quantification.

Thanks for this really useful effort to produce comparable results. I've a lot to do to reach the minimum required, i'll do that as soon as possible.

OlivierLangella commented 4 months ago

Additionnaly, the documentation of i2MassChroQ is available here

OlivierLangella commented 3 months ago

Good afternoon @enryH @mlocardpaulet , sorry for the long delay ;)

So, I've implemented the "proteobench export" button in i2MassChroQ : it produces on Open Document Spreadsheet file containing 2 datasheets. The first datasheet with all the parameters used for the whole workflow (identification parameters, software name and versions, inference parameters, quantification ...). The second datasheet is the proteobench tabulated format as defined previously.

You will find the first draft on nextcloud under the i2MassChroQ repository, named "i2mproteobench_2pep_fdr01psm_fdr01prot.ods".

Is this format ok for you ? If there is too much parameters, we can select the most relevant ones or rename it if needed.

best wishes Olivier

enryH commented 3 months ago

I can open the ods file using Excel, but I takes several minutes (~4mins) to read it using odfpy (through the pandas interface) on Windows. Can you read in your file using pandas faster?

https://pypi.org/project/odfpy/

Otherwise a plain small csv file with the parameter sheet as an export would be maybe even simpler?

Best, Henry

OlivierLangella commented 3 months ago

Oh yes, that's a problem. I've seen odfpy and I thought it would be ok on Windows. Bringing together the parameters and the quantification data would have been simpler for the submission of a single file.

But in this case, I can produce a zip file containing both files in tsv sheets ?

Best, Olivier

enryH commented 3 months ago

Yes that would be best. Currently user have to submit the files anyways separately to the GUI.

  params = ProteoBenchParameters(
      software_name="i2MassChroQ",
      software_version=params.loc["i2MassChroQ_VERSION"],
      search_engine=params.loc["AnalysisSoftware_name"],
      search_engine_version=params.loc["AnalysisSoftware_version"],
      ident_fdr_psm=params.loc["psm_fdr"],
      ident_fdr_peptide=params.loc["peptide_fdr"],
      ident_fdr_protein=params.loc["protein_fdr"],
      enable_match_between_runs=params.loc["mcq_mbr"],
      precursor_mass_tolerance=_tol_prec,
      fragment_mass_tolerance=_tol_frag,
      enzyme=None,
      allowed_miscleavages=params.loc["refine, maximum missed cleavage sites"],
      min_peptide_length=None,
      max_peptide_length=None,
      fixed_mods=None,
      variable_mods=None,
      max_mods=None,
      min_precursor_charge=None,
      max_precursor_charge=params.loc["spectrum, maximum parent charge"],
  )

Do you record

Minimum precursor charge is always 1?

Best, Henry

OlivierLangella commented 3 months ago

Good question, all the parameters used by X!Tandem are reported, but there is not always a direct mapping to the parameters required by proteobench.

The other parameters you've chosen seems ok to me. Best Olivier

enryH commented 3 months ago

I implemented everything. Only major point is that we do not have the minimum and maximum peptide length. I guess the minimum can be inferred knowing the possible AAs, but for now I would leave it out.

The maximum peptide length was in the example you provided 38? Or is this a software specific general statement?

See the current parsed example here

OlivierLangella commented 2 months ago

Thank you very much @enryH ! This is almost done. For the minimum and maximum peptide length, X!Tandem is a bit specific. In the example, I have observed the maximum peptide length of 38, but this is not a software specific general statement. In fact, there is no limit for the peptide modeling engine. Perhaps that we could just leave "None" as the value ? That reflects the fact that there is no constraint ?

An other important thing is that X!Tandem can be used with a second stage database search. This is called "refinement". The second stage only consider proteins were peptide has been found with an Evalue threshold ("refine, maximum valid expectation value"). So, if enabled with "refine" to "yes", then the second stage has its own parameters such as "refine, maximum missed cleavage sites"... I think it is important for the reproducibility to know that refinement is used and what were the parameters for the first stage and the second stage. For "allowed_miscleavages", it is not very simple because there is the possibility to look for missed cleavages at first stage "scoring, maximum missed cleavage sites" and refine it with "refine, maximum missed cleavage sites".

How can we translate it in proteobench ? For a simple solution, at the price of losing information, "allowed_miscleavages" can be set to "scoring, maximum missed cleavage sites" by default and if refinement is set to "yes", then set it to"refine, maximum missed cleavage sites" ?

Thanks again Olivier

enryH commented 2 months ago

How can we translate it in proteobench ? For a simple solution, at the price of losing information, "allowed_miscleavages" can be set to "scoring, maximum missed cleavage sites" by default and if refinement is set to "yes", then set it to"refine, maximum missed cleavage sites" ?

That sound reasonable!

enryH commented 2 months ago

I update the code to pick up the options for allowed_misscleavages as discussed. Regarding the use of tsv instead of odf files. Is there already an option to export the text based files?

OlivierLangella commented 2 months ago

Thanks you very much again ! The zip file generation is on the wire https://forgemia.inra.fr/pappso/i2masschroq/-/commit/e5c74269a1a162060b3d5d1616cbb7cc12356851

But I've several other features to integrate in this release and I need some time to test it. As I'm on vacation, this will be done next week or so. I'll warn you when it's OK. I'll send a zip sample file as soon as possible.

Cheers Olivier

OlivierLangella commented 2 months ago

Hello @enryH ! sorry for the delay, I'm back.

I've uploaded a zip file in the nextcloud of proteobench containing both tsv :

Thanks again Olivier

enryH commented 2 months ago

I updated the parameter parsing: Should mbr now be marked by a T ? Before it was a 1 for True.

https://github.com/Proteobench/ProteoBench/pull/279/commits/8a079cd32094cdc7d034946c2a7b89b5aa924a61

OlivierLangella commented 2 months ago

Thank you very much @enryH ! yes if you don't mind to change the boolean value to T, it would be better.

Cheers Olivier

enryH commented 2 months ago

No it can definitely stay T

OlivierLangella commented 2 months ago

perfect, let's stick to "T" or "F" for booleans