Closed OlivierLangella closed 2 months ago
Thanks a lot! How do you suggest to deal with the ion redundancy due to the presence of all isotopes? Currently, when there are several quantities (or rows in the table) for one precursor ion, we calculate the sum of signal. Would this work fo these data?
You're welcome ! To deal with isotopes, I take for each peptide the theoretically most abundant. The computed theroretical ratio is written in the last column "niratio" of the peptides_q1_All_samples.tsv file. This gives the best results.
I think that using the sum of the signal would lead to less efficiency. In this case, to simplify parsing, it is better to take the monoisotope only : "ninumber" == 0 && "nirank" == 1
An other issue I've seen is for the PTMs : In my results, there is sequence redundancy if the same peptide was quantified at different MH+. I've seen an issue about it #144 . Did you find a solution ?
Perhaps, I can code something to get a tailored output for proteobench inside i2MassChroQ ? Thanks Olivier
For the PTM issue : the column "peptide" is a unique identifier for the sequence+PTM mass
Right now, for the module DDA quantification - precursor ions
, we use one row per precursor ion. And a precursor ion is sequence + localised modification(s) [ideally in proforma format] + charge
So it is perfect for your output.
Regarding the isotope selection, you know best. It just has to be clearly described in the documentation, where we can include a paragraph on MassChroQ outputs :).
One other thing that could be complicated: the parameter file. The way I see it, it should be relatively easy to have your data compatible with ProteoBench for local visualization (addition of your point to the plot). But for public submission (if the user wants the point to be visible by all), we will require upload of some parameter files that contain parameters of interest, which would include search parameters (you can see some when you hover over the plot). Here is an example from one of the current public points:
"software_name":"MaxQuant",
"software_version":"1.5.3.30",
"search_engine_version":null,
"search_engine":"Andromeda",
"ident_fdr_psm":null,
"ident_fdr_peptide":0.01,
"ident_fdr_protein":0.01,
"enable_match_between_runs":"false",
"precursor_mass_tolerance":"4.5 ppm",
"precursor_mass_tolerance_unit":null,
"fragment_mass_tolerance":"20 ppm",
"fragment_mass_tolerance_unit":null,
"enzyme":"Trypsin\/P",
"allowed_miscleavages":1,
"min_peptide_length":7,
"max_peptide_length":null,
"fixed_mods":"Carbamidomethyl (C)",
"variable_mods":"Oxidation (M),Acetyl (Protein N-term)",
"max_mods":5.0,
"min_precursor_charge":null,
"max_precursor_charge":7.0
Is there an output from your tool where we could retrieve these information? It can be several files.
OK, one thing I forgot: we need to match the protein identifiers to the peptide ions. It is easy but right now we upload only one file. So maybe we could discuss having a ProteoBench-compatible output from i2MassChroQ. If you don't mind. Let's discuss it in a meeting.
Right, we'll see it tomorrow. I've added a ProForma column in the results : this was already available in MassChroQ but not used in output files. To generate only one file, I've made a specific output for MS-Stats. It uses a simple R script that merges tables and automatically select the most abundant isotope, so this can be used to build a proteobench-compatible output.
Thanks !
Hi all, here is a quick recap of the meeting:
Thank you very much @mlocardpaulet ! So I have now a "ProteoBench" export button that produces the data file as mentioned : raw file with mzML extension, sequence, ProForma, charge, proteins (species and ; for groups). Normalisation is made using the MCQR package after the raw quantification made by MassChroQ.
The first "proteobench_export.tsv" is available in https://cloud.cmb.ugent.be/index.php/s/zdGB3zZ7Fwed9gq?path=%2FModule_2_DDA_quantification%2Fsearch_results%2Fi2MassChroQ%2Fresult_proteobench_2pep_fdr09.d
Important parameters are :
parameter | value |
---|---|
identification engine | X!Tandem 2017.2.1.4 |
protein inference | i2MassChroQ 1.0.6 |
post processing | MCQR 0.6.11 |
Maximum number of missed cleavages | 1 |
PSM FDR | 0.009 |
Endopeptidase | Trypsin/P |
Fixed modifications | Carbamidomethylation (C) |
Variable modifications | Oxidation (M), Acetyl (Protein N-term) |
Precursor mass tolerance | 10 ppm |
Fragment mass tolerance | 0.02 Da |
Minimum peptide length | N.A. |
Normalisation | median.RT |
I can produce a file with those parameters, it is also possible to produce a single ODS file combining the parameters and peptide quantifications.
how about the .toml ?
Thank you very much Olivier
@OlivierLangella: what field should we use for the quantification? I suspect areanorm
?
Sorry, yes that is "areanorm" Thanks ! Olivier
OK, thanks a lot. I'll have a look at the .toml and the input of the ions table (probably not today). @enryH could you have a look at the parameter parsing? Or give me pointers?
Yes sure. It's simply to read in a toml file as a dictionary. So this will be straight forward. I'll make a proposal and then you can comment on my assignments based on the information above;)
So the parsing and plotting is done! Now we only need the parser for the search engine parameters (@enryH)
Hello @OlivierLangella,
we have made your outputs compatible with ProteoBench. We noticed that the modified peptides are duplicated in the column "ProForma" (here is an example: VPDAVGKC[MOD:00397]R;VPDAVGKC[MOD:00397]R;VPDAVGKC[MOD:00397]R
). Could you please remove this redundancy? We could do it, but I think that it would be cleaner if you do it on your side. What do you think?
Regarding the parameters, @OlivierLangella I don't see any information regarding match between run. Maybe you could add some? At least a TRUE value if there is match between run.
also, @enryH I did not create any tests for this format. I am happy to do it but wouldn't mind a short tour or what to modify (never really been involved in the tests yet)
Is the parameter file already uploaded somewhere?
Is the parameter file already uploaded somewhere?
Actually no. I don't think that we have one. @OlivierLangella could you send us a parameter file?
Last point: I have not written the documentation on how to use i2MassChroQ to get compatible export yet, but I would be happy to discuss it with @OlivierLangella when you have some time.
I have this issue now that it is on the server, and it seems to be only with i2MassChroQ output. It seems to be connected to the toml file? Maybe @julianu would have an idea of where the problem comes from?
It does not crash when I run the main locally.
And it does not add the point.
@OlivierLangella here is the plot (plotted locally - green point). Numbers are really high, it is filtered at 1% FDR, right?
Hello Marie, sorry for the late response, I'm a bit busy and I'll be on vacation next week. Yes this has been filtered using using a psm FDR threshold of 0.8998%, protein FDR of 0.9873%, peptide FDR 1.1963%.
The formula used to compute FDR is #decoy / #target .
I've fixed the ProForma redundancy problem (uploaded in nextcloud).
OK, then we are good to go for local visualization. It is broken in the current version on the server, but it will be fixed in the next and it is corrected in the main. You can already try it locally.
For the parameters, @enryH I suspect that you can find the file MassChroQ informations - q1
on the cloud. Maybe we'll need to check if all the information that we want are there.
Hello Marie, sorry for the late response, I'm a bit busy and I'll be on vacation next week. Yes this has been filtered using using a psm FDR threshold of 0.8998%, protein FDR of 0.9873%, peptide FDR 1.1963%.
The formula used to compute FDR is #decoy / #target .
I've fixed the ProForma redundancy problem (uploaded in nextcloud).
This is totally fine, no hurry. One question regarding FDR: does protein-level validation impact the set of ions that are reported in the ProteoBench output that we have? Are the PSMs from non-validated proteins removed? If it is the case, we may also want to keep track of all the FDRs in your parameter file. The parameter files should contain as many parameters as possible. We can then choose which ones we want to indicate next to the plot in ProteoBench, but for public submission, it would be great to have also a parameter file that contains everything and that we can go back to if needed to understand differences between pipelines.
OK, then we are good to go for local visualization. It is broken in the current version on the server, but it will be fixed in the next and it is corrected in the main. You can already try it locally. For the parameters, @enryH I suspect that you can find the file
MassChroQ informations - q1
on the cloud. Maybe we'll need to check if all the information that we want are there.
Nope. The file contains the following information.
MassChroQ version 2.4.25
Alignment parameters
alignment group id All_samples
alignment group reference XML id LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_02
alignment group reference msrun filename /gorgone/pappso/moulon/raw/20240131_proteobench/LFQ_Orbitrap_DDA_Condition_A_Sample_Alpha_02.mzML
MS1 smoothing half window 0
MS2 smoothing half window 15
MS2 tendency half window 10
XIC parameters
extraction range lower limit 10 ppm
extraction range upper limit 10 ppm
natural isotope minimum abundance 0.8
matching mode post_matching
Filters
antiSpike|5
Detection method
detection Zivy
smoothing half edge window 1
maxmin half edge window 3
minmax half edge window 4
detection threshold on maxmin 3000
detection threshold on minmax 5000
And that's different from what @OlivierLangella reported previously here:
Important parameters are :
parameter value identification engine X!Tandem 2017.2.1.4 protein inference i2MassChroQ 1.0.6 post processing MCQR 0.6.11 Maximum number of missed cleavages 1 PSM FDR 0.009 Endopeptidase Trypsin/P Fixed modifications Carbamidomethylation (C) Variable modifications Oxidation (M), Acetyl (Protein N-term) Precursor mass tolerance 10 ppm Fragment mass tolerance 0.02 Da Minimum peptide length N.A. Normalisation median.RT
Thanks a lot @mlocardpaulet and @enryH !
You're totally right concerning the need to have as much as possible informations about how inference was done, thresholds ... etc. Indeed, in the information file already uploaded in nextcloud, there are all needed informations regarding the quantification, but it lacks informations about protein inference and validation.
At this time, those informations are displayed in the GUI, and exported tables from i2MassChroQ, but my goal is to produce a single parameter file containing all needed things, exported automatically from the "proteobench button"
One question regarding FDR: does protein-level validation impact the set of ions that are reported in the ProteoBench output that we have? Are the PSMs from non-validated proteins removed?
Only validated proteins and PSMs are processed for quantification. PSMs from non validated proteins are removed, unless they are shared by an other validated protein AND they also are valid at the PSM level.
I'e joined to the nextcloud directory the full list of tandem presets used for identification and the full exported tables from i2MassChroQ ("proteobench_2pep_fdr09" directory), containing parameters for validation (i2MassChroQ information.tsv).
There is also the XML project file "proteobench_2pep_fdr09.xpip" that you can use to reload all project in i2MassChroQ.
One last thing I forgot to mention, yes MBR is used for quantification.
Thanks for this really useful effort to produce comparable results. I've a lot to do to reach the minimum required, i'll do that as soon as possible.
Additionnaly, the documentation of i2MassChroQ is available here
Good afternoon @enryH @mlocardpaulet , sorry for the long delay ;)
So, I've implemented the "proteobench export" button in i2MassChroQ : it produces on Open Document Spreadsheet file containing 2 datasheets. The first datasheet with all the parameters used for the whole workflow (identification parameters, software name and versions, inference parameters, quantification ...). The second datasheet is the proteobench tabulated format as defined previously.
You will find the first draft on nextcloud under the i2MassChroQ repository, named "i2mproteobench_2pep_fdr01psm_fdr01prot.ods".
Is this format ok for you ? If there is too much parameters, we can select the most relevant ones or rename it if needed.
best wishes Olivier
I can open the ods file using Excel, but I takes several minutes (~4mins) to read it using odfpy (through the pandas interface) on Windows. Can you read in your file using pandas faster?
https://pypi.org/project/odfpy/
Otherwise a plain small csv file with the parameter sheet as an export would be maybe even simpler?
Best, Henry
Oh yes, that's a problem. I've seen odfpy and I thought it would be ok on Windows. Bringing together the parameters and the quantification data would have been simpler for the submission of a single file.
But in this case, I can produce a zip file containing both files in tsv sheets ?
Best, Olivier
Yes that would be best. Currently user have to submit the files anyways separately to the GUI.
params = ProteoBenchParameters(
software_name="i2MassChroQ",
software_version=params.loc["i2MassChroQ_VERSION"],
search_engine=params.loc["AnalysisSoftware_name"],
search_engine_version=params.loc["AnalysisSoftware_version"],
ident_fdr_psm=params.loc["psm_fdr"],
ident_fdr_peptide=params.loc["peptide_fdr"],
ident_fdr_protein=params.loc["protein_fdr"],
enable_match_between_runs=params.loc["mcq_mbr"],
precursor_mass_tolerance=_tol_prec,
fragment_mass_tolerance=_tol_frag,
enzyme=None,
allowed_miscleavages=params.loc["refine, maximum missed cleavage sites"],
min_peptide_length=None,
max_peptide_length=None,
fixed_mods=None,
variable_mods=None,
max_mods=None,
min_precursor_charge=None,
max_precursor_charge=params.loc["spectrum, maximum parent charge"],
)
Do you record
Minimum precursor charge is always 1?
Best, Henry
Good question, all the parameters used by X!Tandem are reported, but there is not always a direct mapping to the parameters required by proteobench.
The other parameters you've chosen seems ok to me. Best Olivier
I implemented everything. Only major point is that we do not have the minimum and maximum peptide length. I guess the minimum can be inferred knowing the possible AAs, but for now I would leave it out.
The maximum peptide length was in the example you provided 38? Or is this a software specific general statement?
See the current parsed example here
Thank you very much @enryH ! This is almost done. For the minimum and maximum peptide length, X!Tandem is a bit specific. In the example, I have observed the maximum peptide length of 38, but this is not a software specific general statement. In fact, there is no limit for the peptide modeling engine. Perhaps that we could just leave "None" as the value ? That reflects the fact that there is no constraint ?
An other important thing is that X!Tandem can be used with a second stage database search. This is called "refinement". The second stage only consider proteins were peptide has been found with an Evalue threshold ("refine, maximum valid expectation value"). So, if enabled with "refine" to "yes", then the second stage has its own parameters such as "refine, maximum missed cleavage sites"... I think it is important for the reproducibility to know that refinement is used and what were the parameters for the first stage and the second stage. For "allowed_miscleavages", it is not very simple because there is the possibility to look for missed cleavages at first stage "scoring, maximum missed cleavage sites" and refine it with "refine, maximum missed cleavage sites".
How can we translate it in proteobench ? For a simple solution, at the price of losing information, "allowed_miscleavages" can be set to "scoring, maximum missed cleavage sites" by default and if refinement is set to "yes", then set it to"refine, maximum missed cleavage sites" ?
Thanks again Olivier
None
reflects no constraint, so I agree! No minimum or maximum peptide length is explicitly set.How can we translate it in proteobench ? For a simple solution, at the price of losing information, "allowed_miscleavages" can be set to "scoring, maximum missed cleavage sites" by default and if refinement is set to "yes", then set it to"refine, maximum missed cleavage sites" ?
That sound reasonable!
I update the code to pick up the options for allowed_misscleavages
as discussed. Regarding the use of tsv
instead of odf
files. Is there already an option to export the text based files?
Thanks you very much again ! The zip file generation is on the wire https://forgemia.inra.fr/pappso/i2masschroq/-/commit/e5c74269a1a162060b3d5d1616cbb7cc12356851
But I've several other features to integrate in this release and I need some time to test it. As I'm on vacation, this will be done next week or so. I'll warn you when it's OK. I'll send a zip sample file as soon as possible.
Cheers Olivier
Hello @enryH ! sorry for the delay, I'm back.
I've uploaded a zip file in the nextcloud of proteobench containing both tsv :
Thanks again Olivier
I updated the parameter parsing: Should mbr
now be marked by a T
? Before it was a 1 for True
.
https://github.com/Proteobench/ProteoBench/pull/279/commits/8a079cd32094cdc7d034946c2a7b89b5aa924a61
Thank you very much @enryH ! yes if you don't mind to change the boolean value to T, it would be better.
Cheers Olivier
No it can definitely stay T
perfect, let's stick to "T" or "F" for booleans
Hi, is it possible to add support for the ion quantification as exported by i2MassChroQ ?
I've run i2MassChroQ on the supplied dataset (6 files, 2 conditions) as follow : first, conversion to mzML using ThermoRawFileParser.
and then identification with X!Tandem Alanine 2017.1.4,
Identifies peptides were quantified using MassChroQ 2.4.25. It produces 2 TSV files, one that contains protein ids and peptides ids correspondences and one containing peptide, charge, isotope number, area under the curves quantities in each samples.
Can I send you those files somewhere ?
The difference with what is mentioned as required in the proteobench documentation is that this contains all quantified isotopes.
There is no normalization in this table, no data cleaning, no shared peptide removal...
If it is not possible, I can try to generate the custom format as described in the documentation.
Thanks for your time and help Olivier