Proteobench / ProteoBench

ProteoBench is an open and collaborative platform for community-curated benchmarks for proteomics data analysis pipelines. Our goal is to allow a continuous, easy, and controlled comparison of proteomics data analysis workflows. https://proteobench.cubimed.rub.de/
https://proteobench.readthedocs.io
Apache License 2.0
31 stars 8 forks source link

MSFragger output format #182

Closed KlemensFroehlich closed 6 months ago

KlemensFroehlich commented 9 months ago

hi everyone

I just tried to upload a MSFragger results file. I followed the instructions provided in the "How to use" section for the search.

When I upload the example MSFragger output everything seems to work. I get a table, and histograms etc. When I upload my own table, I get the error message:

❌ Proteobench ran into a problem

ImportError: Column Sequence not found in input dataframe. Please check input file and selected software tool.
Traceback:
File "C:\pythonTMP\proteobenchGIT\ProteoBench\webinterface\pages\DDA_Quant.py", line 236, in _run_proteobench
    result_performance, all_datapoints, input_df = Module().benchmarking(
                                                   ^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\froehl0004\AppData\Local\anaconda3\envs\proteoBench\Lib\site-packages\proteobench\modules\dda_quant\module.py", line 298, in benchmarking
    standard_format, replicate_to_raw = ParseInputs().convert_to_standard_format(input_df, parse_settings)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\froehl0004\AppData\Local\anaconda3\envs\proteoBench\Lib\site-packages\proteobench\modules\dda_quant\parse.py", line 131, in convert_to_standard_format
    raise ImportError(

which brings me to the question: Are the column hardcoded in the fragpipe output?

My output seems to look different than the example output.

example output columns:

Peptide Sequence    Modified Sequence   Prev AA Next AA Start   End Peptide Length  M/Z Charge  Assigned Modifications  Protein Protein ID  Entry Name  Gene    Protein Description Mapped Genes    Mapped Proteins A_1 Spectral Count  A_2 Spectral Count  A_3 Spectral Count  B_1 Spectral Count  B_2 Spectral Count  B_3 Spectral Count  A_1 Intensity   A_2 Intensity   A_3 Intensity   B_1 Intensity   B_2 Intensity   B_3 Intensity

my output columns:

Peptide Sequence    Modified Sequence   Prev AA Next AA Start   End Peptide Length  M/Z Charge  Compensation Voltage    Assigned Modifications  Protein Protein ID  Entry Name  Gene    Protein Description Mapped Genes    Mapped Proteins A_1 Spectral Count  A_2 Spectral Count  A_3 Spectral Count  B_4 Spectral Count  B_5 Spectral Count  B_6 Spectral Count  A_1 Apex Retention Time A_2 Apex Retention Time A_3 Apex Retention Time B_4 Apex Retention Time B_5 Apex Retention Time B_6 Apex Retention Time A_1 Intensity   A_2 Intensity   A_3 Intensity   B_4 Intensity   B_5 Intensity   B_6 Intensity   A_1 Match Type  A_2 Match Type  A_3 Match Type  B_4 Match Type  B_5 Match Type  B_6 Match Type

This happens in FragPipe 19 and 20 for me.... maybe a different philosopher version of a different MS1 quant module version number or I specified something incorrectly during the search.

Please find attached the combined_ion output.

combined_ion_FP19.zip

Best Klemens

RobbinBouwmeester commented 9 months ago

@KlemensFroehlich, for me it works with your file, did you select FragPipe in the dropdown menu where you uploaded the file?

Also from the error message it seems to me that you selected maxquant in this dropdown menu?

wolski commented 9 months ago

@KlemensFroehlich

KeyError: "The following 'value_vars' are not present in the DataFrame: ['B_1 Intensity', 'B_2 Intensity', 'B_3 Intensity']"

We expect to see columns B_1 B_2 and B_3 in the tsv. You have B_4, B_5, B_6.

It probably has something to-do with the experimental design setup in FragPipe. Did you name the bio/tech reps 1,2,3,4,5,6, which btw is correct since this is not a paired analysis? Proteobench, however, expects 1,2,3,1,2,3.

wolski commented 9 months ago

See:

https://github.com/Proteobench/ProteoBench/blob/main/proteobench/modules/dda_quant/io_parse_settings/parse_settings_fragpipe.toml

mlocardpaulet commented 9 months ago

Good you saw this @KlemensFroehlich There are two options:

  1. we understand how we generated the data to get the test file and we explain very clearly how to get the "B1, B2, B3" headers,
  2. we change the toml file to fit @KlemensFroehlich output, and we explain how to make sure that people will always get these headers. @brvpuyve, any thoughts on this?
KlemensFroehlich commented 9 months ago

@RobbinBouwmeester I definitely selected FragPipe... the error is reproducible on my system.

ah sorry @wolski you are completely right. Now it works for me when setting the replicates to 123123. I am wodnwering how it can work for @RobbinBouwmeester?

If we are opening up the discussion on the headers of the columns:

The most natural way for me to process MS1 quant files in FragPipe would be the "By file name" asignment of the Experiment name. That is at least what I always do.

Would it be possible to include multiple options how columns can be named or would you prefer to keep it as narrowly defined as possible?

Best Klemens

RobbinBouwmeester commented 9 months ago

@KlemensFroehlich, not sure... Hmmm now to think of it... I first executed a test on MQ to see if nothing else was broken. As a second file I uploaded your provided file. Maybe that explains it?

Easiest way to allow for these things would be to change the .toml that would mean there are multiple entries in the dropdown menu for FragPipe though. Could be a good solution if fixing column names is difficult, if it is very easy to fix column names in MSFragger, I would prefer fixing column names that.

wolski commented 9 months ago

@KlemensFroehlich

I fully agree: "The most natural way for me to process MS1 quant files in FragPipe would be the "By file name" asignment of the Experiment name. That is at least what I always do." Actually, I would do the same when using MaxQuant (can't remember how the option to use file names as sample names is called there.) But this makes setting up and running the software much easier.

Can you send me please a FragPipe result generated with this setting, please? @RobbinBouwmeester Potentially, we can change the default and, deprecate the others.

RobbinBouwmeester commented 9 months ago

@KlemensFroehlich further testing I cannot replicate my earlier results, I think I simply selected the wrong file when it did work for me. Indeed, the file that you uploaded does not work.

@wolski Yes, I agree with you and Klemens, we can easily change the .toml. That would mean it looks more like, e.g., the maxquant.toml. Is it easy to describe the procedure of "By file name" assignment to users?

mlocardpaulet commented 9 months ago

Good stuff. For the instructions, you can find it here (https://proteobench.readthedocs.io/en/latest/modules/3-DDA-Quantification-ion-level/) in the "FragPipe" section. You can either send me what to add/change, or you can directly change the corresponding file (https://github.com/Proteobench/ProteoBench/blob/main/docs/modules/3-DDA-Quantification-ion-level.md) :)

KlemensFroehlich commented 9 months ago

@wolski https://drive.switch.ch/index.php/s/zYwPh9HNCd2MrUZ This is with all the standard parameters (20 ppm initial search etc) of FragPipe but the format should be correct. Let me know if I can do anything else.

@RobbinBouwmeester as for the users:

we could just replace step 2: "Assign experiments in the workflow tab corresponding with the corresponding experimental condition(“A”, “B”)." with: "Following import of raw files (or mzml?), assign experiments "by File Name" right above the list of raw files

image

btw: Do you want to also provide the converted mzML (or mzXML, htrms, dia, ....) files for users? I dont know whether this can be a source of variance? If the procedure how some programs convert raw files to specific input format changes this might be a confounder. But I honestly think this SHOULD not be the case :D

Best, Klemens

mlocardpaulet commented 9 months ago

btw: Do you want to also provide the converted mzML (or mzXML, htrms, dia, ....) files for users? I dont know whether this can be a source of variance? If the procedure how some programs convert raw files to specific input format changes this might be a confounder. But I honestly think this SHOULD not be the case :D

We discussed this and we will ignore it for this module. If we wanted to benchmark this step, we would need a dedicated module.

brvpuyve commented 9 months ago

Good catch @wolski!!

mlocardpaulet commented 8 months ago

I have amended the instructions in the documentation in PR #188