Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
184 stars 37 forks source link

Advise on file labeling (IonQuant + MBR) before MSFragger Close Search using FragPipe #363

Closed BenSamy2020 closed 3 years ago

BenSamy2020 commented 3 years ago

Greetings,

Currently, I am searching patient derived serum sample data using FragPipe (15.1 Built 4). I would like to perform MS1 quantification. Briefly, my experimental design is as following: From a single patient, blood was collected across 3 days (Day 0, 1, 2). Collected blood was processed and peptides were analyzed with 3 technical repeats for each blood collection day sample. I have a total of 12 .raw files to be searched.

My file labelling layout is as following:

Path Experiment Replicate
Patient_1_Day_0_Technical_Replicate_1 Exp_1 1
Patient_1_Day_0_Technical_Replicate_2 Exp_1 2
Patient_1_Day_0_Technical_Replicate_3 Exp_1 3
Patient_1_Day_1_Technical_Replicate_1 Exp_2 1
Patient_1_Day_1_Technical_Replicate_2 Exp_2 2
Patient_1_Day_1_Technical_Replicate_3 Exp_2 3
Patient_1_Day_2_Technical_Replicate_1 Exp_3 1
Patient_1_Day_2_Technical_Replicate_2 Exp_3 2
Patient_1_Day_2_Technical_Replicate_3 Exp_3 3

Collectively, is the labelling of the file appropriate for my study design and requirement?

Regards, Ben

fcyu commented 3 years ago

Hi Ben,

Your layout is OK. But if you want to use MSstats afterward, you might need to give replicate 1, 2, 3, 4, 5.....

Here (https://github.com/Nesvilab/MSFragger/blob/master/tutorial_fragpipe.md#multi-experiment-report) is an explanation about it. And here (https://github.com/Nesvilab/FragPipe/issues/183) is the discussion.

Best,

Fengchao

BenSamy2020 commented 3 years ago

Greetings FengChao,

Thank you for your prompt reply. I will take your suggestion into account. Apart from the above file labelling format provided, I was also interested to try out another file format labelling. The format of the labelling is as following:

Path Experiment Replicate
Patient_1_Day_0_Technical_Replicate_1 Exp 1
Patient_1_Day_0_Technical_Replicate_2 Exp 2
Patient_1_Day_0_Technical_Replicate_3 Exp 3
Patient_1_Day_1_Technical_Replicate_1 Exp 4
Patient_1_Day_1_Technical_Replicate_2 Exp 5
Patient_1_Day_1_Technical_Replicate_3 Exp 6
Patient_1_Day_2_Technical_Replicate_1 Exp 7
Patient_1_Day_2_Technical_Replicate_2 Exp 8
Patient_1_Day_2_Technical_Replicate_3 Exp 9

When I compared the total proteome depth of both search file labelling format, the second option of file labeling provided in this message gave an increased protein identification count. Can I utilize this file labeling format to search my file and perform MS1 quantification? (since FragPipe will provide the abundance of quantified proteins across different file, I can still perform differential protein abundance analysis)

Regards, Ben

fcyu commented 3 years ago

Hi Ben,

Yes, you can. But I don't understand why it resulted in more identified proteins. It should not affect the identification result since both format put each run into separated folders.

Best,

Fengchao

BenSamy2020 commented 3 years ago

Hi FengChao,

Apologies, after looking through my log, I realized that I was searching 2 different set of files. Hence, the difference.

Regards, Ben

BenSamy2020 commented 3 years ago

Greetings @fcyu,

My experimental design for a different study is slightly more complicated and have trouble deciding on the labelling. Currently, I have 2 cell line, 3 biological repeat for each cell line and 3 fractions for each biological repeat. My file labeling is as following (I will be using MSStats for downstream analysis) (please do advise me if the labeling is appropriate):

Path Experiment Replicate
SUM159PT_B1_Fraction_1 SUM159PT_B1 1
SUM159PT_B1_Fraction_2 SUM159PT_B1 1
SUM159PT_B1_Fraction_3 SUM159PT_B1 1
SUM159PT_B2_Fraction_1 SUM159PT_B2 2
SUM159PT_B2_Fraction_2 SUM159PT_B2 2
SUM159PT_B2_Fraction_3 SUM159PT_B2 2
SUM159PT_B3_Fraction_1 SUM159PT_B3 3
SUM159PT_B3_Fraction_2 SUM159PT_B3 3
SUM159PT_B3_Fraction_3 SUM159PT_B3 3
HS578T_B1_Fraction_1 HS578T_B1 4
HS578T_B1_Fraction_2 HS578T_B1 4
HS578T_B1_Fraction_3 HS578T_B1 4
HS578T_B2_Fraction_1 HS578T_B2 5
HS578T_B2_Fraction_2 HS578T_B2 5
HS578T_B2_Fraction_3 HS578T_B2 5
HS578T_B3_Fraction_1 HS578T_B3 6
HS578T_B3_Fraction_2 HS578T_B3 6
HS578T_B3_Fraction_3 HS578T_B3 6

Thank you, Ben

fcyu commented 3 years ago

Hi Ben,

Your labelling looks good.

Best,

Fengchao

BenSamy2020 commented 3 years ago

Thank you brother.

BenSamy2020 commented 3 years ago

Greetings @fcyu,

I have successfully performed the search (Default_MBR). Currently, I am trying to process the fragpipe output MSstats file using MSstats. I am consistently experience the error of:

"Error in dataProcess(raw, logTrans = 10) : MSstats suspects that there are fractionations and potentially technical replicates too. Please add Fraction column in the input."**

I am not very sure, but I suspect that the MSstats file is lacking a fraction column. I have provided my MSstat file below for your reference. Please do advise me on how I should proceed?

Regards, Ben. MSstats.zip

fcyu commented 3 years ago

Hmm, interesting. It looks like there is a "hidden" column not documented in either https://www.bioconductor.org/packages/release/bioc/manuals/MSstats/man/MSstats.pdf or https://www.bioconductor.org/packages/release/bioc/vignettes/MSstats/inst/doc/MSstats.html.

After some digging, there is an related issue (https://github.com/RobertsLab/resources/issues/516#issuecomment-449133673). I tested by adding a Fraction column with raw$Fraction <- 1 (but you need to specify different numbers if your sample has different fractions), it works. For now, could you add that column by yourself? We will fix the output in the next release.

Best,

Fengchao

BenSamy2020 commented 3 years ago

Greetings @fcyu,

Thank you for your rapid reply. Unfortunately, excel consistently hangs when I perform the changes due to the sheer amount of data on the csv file.

By any chance, will I be able to know when (~approximate) you will be releasing the next release? (I am sorry for being very "pressy"). I have collaborators waiting for the analyzed results.

Regards Ben.

fcyu commented 3 years ago

Hi Ben,

You can easily add it in R with the command like raw$Fraction <- 1. If your data had different fractions, it would be a little bit more complicated, but I am sure that it is feasible.

Best,

Fengchao

BenSamy2020 commented 3 years ago

Greetings @fcyu,

For the representative file labeling format table that I sent you in this thread previously (table below for your reference), the group comparison matrix gets very tricky. I am currently comparing 10 cell lines, each cell lines with 3 biological repeat and each biological repeat has 3 fractions (I managed to incorporate a fraction column in the MSstats.csv file) (refer to MSStats QC plot for study file labels below).

QCPlot.zip

Path Experiment Replicate
SUM159PT_B1_Fraction_1 SUM159PT_B1 1
SUM159PT_B1_Fraction_2 SUM159PT_B1 1
SUM159PT_B1_Fraction_3 SUM159PT_B1 1
SUM159PT_B2_Fraction_1 SUM159PT_B2 2
SUM159PT_B2_Fraction_2 SUM159PT_B2 2
SUM159PT_B2_Fraction_3 SUM159PT_B2 2
SUM159PT_B3_Fraction_1 SUM159PT_B3 3
SUM159PT_B3_Fraction_2 SUM159PT_B3 3
SUM159PT_B3_Fraction_3 SUM159PT_B3 3
HS578T_B1_Fraction_1 HS578T_B1 4
HS578T_B1_Fraction_2 HS578T_B1 4
HS578T_B1_Fraction_3 HS578T_B1 4
HS578T_B2_Fraction_1 HS578T_B2 5
HS578T_B2_Fraction_2 HS578T_B2 5
HS578T_B2_Fraction_3 HS578T_B2 5
HS578T_B3_Fraction_1 HS578T_B3 6
HS578T_B3_Fraction_2 HS578T_B3 6
HS578T_B3_Fraction_3 HS578T_B3 6

For example, I will have to write (comparison of 2 cell lines): comparison1 <- matrix(c(0.333,0.333,0.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.333,-0.333,-0.333)

Collectively, analysis with MSstats gets complicated and throws multiple errors. To overcome these issues, I relabeled (removed _B1, _B2, _B3) my FragPipe workflow tab file to:

Path Experiment Replicate
SUM159PT_B1_Fraction_1 SUM159PT 1
SUM159PT_B1_Fraction_2 SUM159PT 1
SUM159PT_B1_Fraction_3 SUM159PT 1
SUM159PT_B2_Fraction_1 SUM159PT 2
SUM159PT_B2_Fraction_2 SUM159PT 2
SUM159PT_B2_Fraction_3 SUM159PT 2
SUM159PT_B3_Fraction_1 SUM159PT 3
SUM159PT_B3_Fraction_2 SUM159PT 3
SUM159PT_B3_Fraction_3 SUM159PT 3
HS578T_B1_Fraction_1 HS578T 4
HS578T_B1_Fraction_2 HS578T 4
HS578T_B1_Fraction_3 HS578T 4
HS578T_B2_Fraction_1 HS578T 5
HS578T_B2_Fraction_2 HS578T 5
HS578T_B2_Fraction_3 HS578T 5
HS578T_B3_Fraction_1 HS578T 6
HS578T_B3_Fraction_2 HS578T 6
HS578T_B3_Fraction_3 HS578T 6

Based on your experience, do you think this above labelling is appropriate for MSstats and its downstream analysis? I lastly would like to apologize, I understand that MSstats is not your managed tool. It would be nice if I could get your input on how I can proceed with MSstats analysis.

Regards Ben

anesvi commented 3 years ago

Please email the msstats team, I am sure they will be happy to advise you on your experimental design

Sent from my iPhone

On Jun 2, 2021, at 4:46 AM, BenSamy2020 @.***> wrote:

 External Email - Use Caution

Greetings @fcyuhttps://github.com/fcyu,

For the representative file labeling format table that I sent you in this thread previously (table below for your reference), the group comparison matrix gets very tricky. I am currently comparing 10 cell lines, each cell lines with 3 biological repeat and each biological repeat has 3 fractions (I managed to incorporate a fraction column in the MSstats.csv file) (refer to MSStats QC plot for study file labels below).

Path Experiment Replicate SUM159PT_B1_Fraction_1 SUM159PT_B1 1 SUM159PT_B1_Fraction_2 SUM159PT_B1 1 SUM159PT_B1_Fraction_3 SUM159PT_B1 1 SUM159PT_B2_Fraction_1 SUM159PT_B2 2 SUM159PT_B2_Fraction_2 SUM159PT_B2 2 SUM159PT_B2_Fraction_3 SUM159PT_B2 2 SUM159PT_B3_Fraction_1 SUM159PT_B3 3 SUM159PT_B3_Fraction_2 SUM159PT_B3 3 SUM159PT_B3_Fraction_3 SUM159PT_B3 3 HS578T_B1_Fraction_1 HS578T_B1 4 HS578T_B1_Fraction_2 HS578T_B1 4 HS578T_B1_Fraction_3 HS578T_B1 4 HS578T_B2_Fraction_1 HS578T_B2 5 HS578T_B2_Fraction_2 HS578T_B2 5 HS578T_B2_Fraction_3 HS578T_B2 5 HS578T_B3_Fraction_1 HS578T_B3 6 HS578T_B3_Fraction_2 HS578T_B3 6 HS578T_B3_Fraction_3 HS578T_B3 6

For example, I will have to write (comparison of 2 cell lines): comparison1 <- matrix(c(0.333,0.333,0.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.333,-0.333,-0.333)

Collectively, analysis with MSstats gets complicated and throws multiple errors. To overcome these issues, I relabeled my FragPipe workflow tab file to:

Path Experiment Replicate SUM159PT_B1_Fraction_1 SUM159PT 1 SUM159PT_B1_Fraction_2 SUM159PT 1 SUM159PT_B1_Fraction_3 SUM159PT 1 SUM159PT_B2_Fraction_1 SUM159PT 2 SUM159PT_B2_Fraction_2 SUM159PT 2 SUM159PT_B2_Fraction_3 SUM159PT 2 SUM159PT_B3_Fraction_1 SUM159PT 3 SUM159PT_B3_Fraction_2 SUM159PT 3 SUM159PT_B3_Fraction_3 SUM159PT 3 HS578T_B1_Fraction_1 HS578T 4 HS578T_B1_Fraction_2 HS578T 4 HS578T_B1_Fraction_3 HS578T 4 HS578T_B2_Fraction_1 HS578T 5 HS578T_B2_Fraction_2 HS578T 5 HS578T_B2_Fraction_3 HS578T 5 HS578T_B3_Fraction_1 HS578T 6 HS578T_B3_Fraction_2 HS578T 6 HS578T_B3_Fraction_3 HS578T 6

Based on your experience, do you think this above labelling is appropriate for MSstats and its downstream analysis? I lastly would like to apologize, I understand that MSstats is not your managed tool. It would be nice if I could get your input on how I can proceed with MSstats analysis.

Regards Ben

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/363#issuecomment-852844639, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM6ZFBNDJLMRYDVXAOKTTQXVVRANCNFSM43YKIU5Q.


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

fcyu commented 3 years ago

Hi Ben,

I am not sure about this one. As Alexey pointed out, it would be better to ask MSStats team.

Best,

Fengchao

bluemooninvestor commented 3 weeks ago

Greetings @fcyu,

My experimental design for a different study is slightly more complicated and have trouble deciding on the labelling. Currently, I have 2 cell line, 3 biological repeat for each cell line and 3 fractions for each biological repeat. My file labeling is as following (I will be using MSStats for downstream analysis) (please do advise me if the labeling is appropriate):

Path Experiment Replicate SUM159PT_B1_Fraction_1 SUM159PT_B1 1 SUM159PT_B1_Fraction_2 SUM159PT_B1 1 SUM159PT_B1_Fraction_3 SUM159PT_B1 1 SUM159PT_B2_Fraction_1 SUM159PT_B2 2 SUM159PT_B2_Fraction_2 SUM159PT_B2 2 SUM159PT_B2_Fraction_3 SUM159PT_B2 2 SUM159PT_B3_Fraction_1 SUM159PT_B3 3 SUM159PT_B3_Fraction_2 SUM159PT_B3 3 SUM159PT_B3_Fraction_3 SUM159PT_B3 3 HS578T_B1_Fraction_1 HS578T_B1 4 HS578T_B1_Fraction_2 HS578T_B1 4 HS578T_B1_Fraction_3 HS578T_B1 4 HS578T_B2_Fraction_1 HS578T_B2 5 HS578T_B2_Fraction_2 HS578T_B2 5 HS578T_B2_Fraction_3 HS578T_B2 5 HS578T_B3_Fraction_1 HS578T_B3 6 HS578T_B3_Fraction_2 HS578T_B3 6 HS578T_B3_Fraction_3 HS578T_B3 6

Thank you, Ben

For the quoted text, I think the experiment column should have same value for all the biological replicates (for use in MSStats). Experiment(Fragpipe) = Condition (MSStats)