Open KlemensFroehlich opened 2 years ago
We will support MSstatsTMT in the future. Stay tuned!
BTW, for the LFQ data, we recommend using IonQuant in the MS1 Quant
tab and not enabling generate msstats files
in the validation
tab. IonQuant will always generate a MSstats.tsv with LFQ intensities from all experiments.
Best,
Fengchao
Thanks Fengchao for the answer. Looking forward to using fragpipe for everything, including TMT in the future :)
Best, Klemens
Dear FragPipe team,
I was wondering what is the current status of the MSstatsTMT support? I instructed FragPipe 19.0 to generate msstats.csv
files in the context of a TMT10plex workflow. But I am struggling to identify the intended way of importing the data into MSstatsTMT
. My naive guess was that
MSstatsTMT::PhilosophertoMSstatsTMTFormat(path = paste0(path,"TMT10plex_T1_2"), folder = TRUE, annotation = paste0(path,"/combined_annotation.tsv")
+ )
INFO [2023-03-14 17:04:24] ** Raw data from Philosopher imported successfully.
Error in annotation[["Channel"]] : subscript out of bounds
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lattice_0.20-45 BiocParallel_1.28.3 TPP2D_1.10.0 dplyr_1.0.8 readr_2.1.2
loaded via a namespace (and not attached):
[1] tidyr_1.2.0 bit64_4.0.5 vroom_1.5.7 splines_4.1.2 foreach_1.5.2 gtools_3.9.2 assertthat_0.2.1 yaml_2.3.5
[9] ggrepel_0.9.1 numDeriv_2016.8-1.1 backports_1.4.1 pillar_1.7.0 glue_1.6.2 limma_3.50.1 digest_0.6.29 checkmate_2.0.0
[17] minqa_1.2.4 colorspace_2.0-3 preprocessCore_1.56.0 htmltools_0.5.2 Matrix_1.4-1 pkgconfig_2.0.3 MSstatsTMT_2.2.7 purrr_0.3.4
[25] scales_1.1.1 openxlsx_4.2.5 tzdb_0.3.0 lme4_1.1-28 tibble_3.1.6 generics_0.1.2 farver_2.1.0 ggplot2_3.3.5
[33] ellipsis_0.3.2 withr_2.5.0 cli_3.2.0 survival_3.3-1 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15 fansi_1.0.3
[41] doParallel_1.0.17 nlme_3.1-157 MASS_7.3-56 log4r_0.4.2 gplots_3.1.1 tools_4.1.2 data.table_1.14.2 hms_1.1.1
[49] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0 zip_2.2.0 MSstats_4.2.0 compiler_4.1.2 caTools_1.18.2 rlang_1.0.2
[57] grid_4.1.2 RCurl_1.98-1.6 nloptr_2.0.0 iterators_1.0.14 rstudioapi_0.13 marray_1.72.0 bitops_1.0-7 labeling_0.4.2
[65] rmarkdown_2.13 boot_1.3-28 lmerTest_3.1-3 gtable_0.3.0 codetools_0.2-18 DBI_1.1.2 R6_2.5.1 knitr_1.38
[73] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 MSstatsConvert_1.4.1 KernSmooth_2.23-20 stringi_1.7.6 parallel_4.1.2 Rcpp_1.0.8.3
[81] vctrs_0.4.0 tidyselect_1.1.2 xfun_0.30
would be the right way. But it seems like the annotation file is not structured in the expected way. I also noticed that the combined_annotation.tsv
has been renamed in the latest FragPipe release. Is this file intended for MSstats import at all? Or does one need to construct the annotation file manually according to the MSstatsTMT package vignette (but PhilosophertoMSstatsTMTFormat() is the right import function to use)?
> library(readr)
> msstats_T1_2 <- read_csv("~/Downloads/WU286728/TMT10plex_T1_2/msstats.csv")
Rows: 170933 Columns: 23
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Spectrum.Name, Spectrum.File, Peptide.Sequence, Modified.Peptide.Sequence, Gene, Protein.Accessions
dbl (15): Charge, Calculated.MZ, PeptideProphet.Probability, Intensity, Purity, Channel 126, Channel 127N, Channel 127C, Channel 128N, Channel 128C, Channel 129N, Channel 129C, ...
lgl (2): Is.Unique, Modifications
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> View(msstats_T1_2)
> msstats_T1_2
# A tibble: 170,933 × 23
Spectrum.Name Spectrum.File Peptide.Sequence Modified.Peptid… Charge Calculated.MZ PeptideProphet.… Intensity Is.Unique Gene Protein.Accessi… Modifications Purity `Channel 126`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <chr> <chr> <lgl> <dbl> <dbl>
1 20230220_002_… 20230220_002… SHHEDRAGHGHSADS… n[230]SHHEDRAGH… 3 691. 0.822 20767. FALSE FLG sp|P20930|FILA_… NA 1 0
2 20230220_002_… 20230220_002… RRVEHHDHAVVSGR NA 4 414. 0.806 100653. FALSE AIFM1 sp|O95831|AIFM1… NA 0.84 0
3 20230220_002_… 20230220_002… RVEHHDHAVVSGR NA 4 375. 0.999 6101583 FALSE AIFM1 sp|O95831|AIFM1… NA 0.95 0
4 20230220_002_… 20230220_002… HGSGLGHSSSHGQHG… n[230]HGSGLGHSS… 5 424. 1 412864. FALSE HRNR sp|Q86YZ3|HORN_… NA 0.98 0
5 20230220_002_… 20230220_002… HEECSRPHNGR n[230]HEECSRPHN… 4 403. 0.913 491961. TRUE THOC6 sp|Q86W42|THOC6… NA 0.95 932.
6 20230220_002_… 20230220_002… HGGEDGRNNSGAPHR n[230]HGGEDGRNN… 4 448. 0.777 438030. FALSE ACBD5 tr|A0A7I2V2Y9|A… NA 0.88 0
7 20230220_002_… 20230220_002… NTPSQHSHSIQHSPER NA 3 615. 1 785370. FALSE BCLA… sp|Q9NYF8|BCLF1… NA 0.73 0
8 20230220_002_… 20230220_002… NTPSQHSHSIQHSPER NA 4 461. 1.00 2382806 FALSE BCLA… sp|Q9NYF8|BCLF1… NA 0.69 0
9 20230220_002_… 20230220_002… SHHKDHSDSESTSSD… n[230]SHHKDHSDS… 5 506. 0.998 96035. FALSE KDM6A sp|O15550|KDM6A… NA 0.83 0
10 20230220_002_… 20230220_002… GNCNRGENDCR n[230]GNCNRGEND… 3 528. 1 597927. FALSE MBNL1 sp|Q9NR56|MBNL1… NA 0.86 6457.
# … with 170,923 more rows, and 9 more variables: `Channel 127N` <dbl>, `Channel 127C` <dbl>, `Channel 128N` <dbl>, `Channel 128C` <dbl>, `Channel 129N` <dbl>,
# `Channel 129C` <dbl>, `Channel 130N` <dbl>, `Channel 130C` <dbl>, `Channel 131N` <dbl>
>
Thanks a lot for your help, Tobi
Hi Tobi,
We have a version that support MSstatsTMT better. We have a tutorial about it: https://docs.google.com/document/d/1TqO9WDI3k_1FTOI1dQYV4D4nf7C9TX7Xl9AzHxYNe84/edit
But it seems like the annotation file is not structured in the expected way. I also noticed that the combined_annotation.tsv has been renamed in the latest FragPipe release. Is this file intended for MSstats import at all? Or does one need to construct the annotation file manually according to the MSstatsTMT package vignette (but PhilosophertoMSstatsTMTFormat() is the right import function to use)?
The combined_annotation.tsv
is for FragPipe-Analyst. The pre-released version has another annotation file for it.
Could you please take a look and try the pre-released version?
Thanks,
Fengchao
Hi @fcyu
thanks for sharing the tutorial. I had a look and generated a corresponding MSstatsTMT_annotation.csv
for my local data. It looks like:
> Dataset_44878_item_
# A tibble: 10 × 7
Run Fraction TechRepMixture Mixture Channel BioReplicate Condition
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 126 S449267_126 37_5
2 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 127N S449267_127N 37_1
3 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 127C S449267_127C 37_0.134
4 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 128N S449267_128N 37_0.02
5 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 128C S449267_128C 37_0
6 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 129N S449267_129N 39.3_5
7 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 129C S449267_129C 39.3_1
8 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 130N S449267_130N 39.3_0.134
9 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 130C S449267_130C 39.3_0.02
10 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw 1 1 1 131N S449267_131N 39.3_0
But when I execute the import function I get:
> test <- MSstatsTMT::PhilosophertoMSstatsTMTFormat(path = paste0(path,"TMT10plex_T1_2"), folder = TRUE, annotation = Dataset_44878_item_)
INFO [2023-03-15 08:51:02] ** Raw data from Philosopher imported successfully.
INFO [2023-03-15 08:51:03] ** Using provided annotation.
INFO [2023-03-15 08:51:03] ** Run and Channel labels were standardized to remove symbols such as '.' or '%'.
INFO [2023-03-15 08:51:03] ** The following options are used:
- Features will be defined by the columns: PeptideSequence, PrecursorCharge
- Shared peptides will be removed.
- Proteins with single feature will not be removed.
- Features with less than 3 measurements within each run will be removed.
INFO [2023-03-15 08:51:03] ** Rows with values not greater than 0.6 in Purity are removed
INFO [2023-03-15 08:51:03] ** Rows with values not greater than 0.7 in PeptideProphetProbability are removed
INFO [2023-03-15 08:51:03] ** Sequences containing Oxidation are removed.
INFO [2023-03-15 08:51:03] ** Features with all missing measurements across channels within each run are removed.
INFO [2023-03-15 08:51:04] ** Shared peptides are removed.
INFO [2023-03-15 08:51:04] ** Features with one or two measurements across channels within each run are removed.
INFO [2023-03-15 08:51:17] ** PSMs have been aggregated to peptide ions.
INFO [2023-03-15 08:51:18] ** Run annotation merged with quantification data.
WARN [2023-03-15 08:51:18] ** Condition in the input file must match condition in annotation.
INFO [2023-03-15 08:51:19] ** Features with one or two measurements across channels within each run are removed.
INFO [2023-03-15 08:51:19] ** Fractionation handled.
INFO [2023-03-15 08:51:20] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO [2023-03-15 08:51:20] ** Finished preprocessing. The dataset is ready to be processed by the proteinSummarization function.
> head(test)
ProteinName PeptideSequence Charge PSM Mixture TechRepMixture Run Channel BioReplicate Condition
1 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 126 <NA> <NA>
2 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 127C <NA> <NA>
3 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 127N <NA> <NA>
4 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 128C <NA> <NA>
5 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 128N <NA> <NA>
6 sp|Q9Y4H2|IRS2_HUMAN AAAAAAAAVPSAGPAGPAPTSAAGR 3 AAAAAAAAVPSAGPAGPAPTSAAGR_3 <NA> <NA> 20230220_012_S449277_TMT10plex_T1_2_11_rep 129C <NA> <NA>
Intensity
1 33370.71
2 36490.16
3 31099.41
4 34305.99
5 36577.45
6 32279.51
It warns that Condition in the input file must match condition in annotation.
and only puts missing values. But I can see any condition in the input data (the MSstats.csv
file).
Does the pre-release version change the content of MSstats.csv
? I can't easily change to a pre-release, since the data was processed by our scripted production pipeline at the core facility.
Best, Tobi
Hi Tobi, We generated the file by working closely with Devon from Olga Vitek lab (MSStats) He tested it extensively Can you email me directly so I can forward your email to him? Thanks Alexey
From: Tobias Kockmann @.> Sent: Wednesday, March 15, 2023 10:10 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: Re: [Nesvilab/FragPipe] msstatsTMT (Issue #510)
External Email - Use Caution
Hi @fcyuhttps://github.com/fcyu
thanks for sharing the tutorial. I had a look and generated a corresponding MSstatsTMT_annotation.csv for my local data. It looks like:
Dataset_44878item
Run Fraction TechRepMixture Mixture Channel BioReplicate Condition
Hi Tobi,
Does the pre-release version change the content of MSstats.csv? I can't easily change to a pre-release, since the data was processed by our scripted production pipeline at the core facility.
I believe we changed something in the Philosopher but I can't remember what they were since there is no changelog for the RC versions. But I suggest you try to re-process some of your data using the pre-released versions to make sure that they work for you.
Best,
Fengchao
Thanks for the kind offer @anesvi ! I send the Email.
Hi Tobi, yes you have to use the pre-release version. We changed the msstats files that philosopher writes so they are compatible with msstatsTMT. The publicly released version is not compatible.
I fear I can't use the pre-release at the moment. Than this issue needs to wait till we have an official release.
I see this exact issue when I keep the extension (.raw, .mzml) in the Run column of the annotation table (NAs in many columns), and when I remove the extension (in your case, it would be 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw to 20230220_002_S449267_TMT10plex_T1_2_1_rep), the Condition, Mixture, BioReplicate, and TechRepMixture columns go from NA to the actual names. There may also be other reasons to wait until the new release, but try changing the Run column and see if that fixes this issue.
@clairesimpson95 This depends on the content of your input data table. In my case Spectrum File
also includes the extension (.raw). They just need to match.
I see this exact issue when I keep the extension (.raw, .mzml) in the Run column of the annotation table (NAs in many columns), and when I remove the extension (in your case, it would be 20230220_002_S449267_TMT10plex_T1_2_1_rep.raw to 20230220_002_S449267_TMT10plex_T1_2_1_rep), the Condition, Mixture, BioReplicate, and TechRepMixture columns go from NA to the actual names. There may also be other reasons to wait until the new release, but try changing the Run column and see if that fixes this issue.
Yep, the way the MSstatsTMT function is checking the match is a bit weird. I just imported my data as follows and it worked fine.
msstats_df <- read_delim("msstats.csv", delim = ",", escape_double = FALSE, trim_ws = TRUE) %>% mutate(Spectrum.File = str_remove(Spectrum.File, ".mzML"))
So I would say, as long as you remove the .raw or .mzML in FragPipe's msstats.csv output rather than in the annotation file, it should work.
Hi, I am trying to use Fragpipe for TMT 6-plex and can't generate the files compatible with MSstats. I was wondering if I am doing anything wrong or if Fragpipe/Philosopher has some sort of problem with TMT 6-plex (as I've seen in other posts).
In more details: I am running Fragpipe v21, MSfragger v4.0, IonQuant v1.10.12, and Philosopher v5.1.0. I have 3 TMT-6plex experiments, and I converted them from RAW to MzML. I tried two different approaches:
1) Followed the tutorial for multiple plexes on the website, which is basically the same as the Docs linked above.
We have a version that support MSstatsTMT better. We have a tutorial about it: https://docs.google.com/document/d/1TqO9WDI3k_1FTOI1dQYV4D4nf7C9TX7Xl9AzHxYNe84/edit
In short:
All went well, but the only msstats.csv I found was the one generated in the output folder, and it does not contain information per channel. The quantification seems to be pooled per TMT-plex. I also checked the "tmt-report" folder, but the data is already summarized to proteins, so it won't be compatible with MSstats. I also checked each TMT folder output but didn't find any msstats.csv in them.
When checking the MSstatsTMT HTML tutorial, little information is provided as to which files to use; the only information is to use "PhilosophertoMSstatsTMTFormat()", which leads me to the other test below.
2) Using "Philosopher" as the Intensity Extraction Tool.
I tried using Phisolopher, which should be compatible with MSstatsTMT. However, "Philosopher Abacus" crashed, so I followed the recommendation on #1324, which is to disable "Generate reports" and "Generate MSstats files". It then finished the search, but I am still missing msstats.csv compatible files.
Am I missing something? Or is TMT-6plex output not supported for MSstatsTMT?
I can provide the RAW (or MzML) if you need it.
For the current version, you should use Philosopher as the intensity extraction tool to generate the MSstatsTMT compatiable msstats.csv. In the future, we will make it more robust to support both Philosopher and IonQuant.
I tried using Phisolopher, which should be compatible with MSstatsTMT. However, "Philosopher Abacus" crashed, so I followed the recommendation on https://github.com/Nesvilab/FragPipe/issues/1324, which is to disable "Generate reports" and "Generate MSstats files". It then finished the search, but I am still missing msstats.csv compatible files.
You need to enable "generate reports" and "generate MSstats files" to generate the TMT msstats.csv. Could you share the log which Abacus crashed?
Thanks,
Fengchao
Hi Fengchao,
Thank you for such a speedy response.
I am attaching the log file. Please, let me know if you need anything else! log_2024-02-18_13-22-23.txt
Best, Luiz
Hi Luiz,
Thanks for the log file. It looks like Abacus does not support TMT 6:
ERRO[13:22:23] unsupported number of labels
I am afraid you have to wait for the future release.
Best,
Fengchao
I see, that's ok, at least now I know I am not doing something wrong on my end.
I won't be able to use Fragpipe to generate an output of TMT 6-plex compatible with MSstatsTMT, but assuming I would use another tool for data analysis, and that I would want information at the peptide level, do you recommend switching extraction to IonQuant or keep with Philosopher but disable both the "generate reports" and "generate MSstats files"?
If you use the TMT-Intetragor reports in the tmt-report
folder, using IonQuant or Philosopher does not have much difference except that IonQuant is faster and supports raw file format.
And yes, you need to disable "generate reports" and "generate MSstats files".
Best,
Fengchao
Hi Fengchao,
I was reading the log file and noticed that DIA-NN gets triggered even though I loaded DDA files and did not enable "Spectral library generation" or "Quant (DIA)". Is there any reason for it?
I have two additional questions unrelated to MSstatsTMT. I am listing them below, but I can move/create another issue if it works better for you. 1) I read the "Clip N-term M" description in MSfragger wiki, but it is still unclear to me. Is it removing all n-terminal methionine during in silico generation of the peptides? Is there any specific reason why I should uncheck it? 2) When Fragpipe runs out of memory, is there any way to estimate the required RAM? E.g., I have a dataset where I split the data into 25, which did not fix the problem. Only reducing max peptide size from 50 to 25 that solved the issue, but I could only solve it with trial and error.
Best, Luiz
Hi Luiz,
I was reading the log file and noticed that DIA-NN gets triggered even though I loaded DDA files and did not enable "Spectral library generation" or "Quant (DIA)". Is there any reason for it?
I guess what you were looking at was MSBooster using DIA-NN spectral prediction module to predict and calculate identification scores. It is not about DIA.
I read the "Clip N-term M" description in MSfragger wiki, but it is still unclear to me. Is it removing all n-terminal methionine during in silico generation of the peptides? Is there any specific reason why I should uncheck it?
It considers both: with and without the N-terminal M. It is because of the biological process that most N-terminal M is clipped in vivo.
When Fragpipe runs out of memory, is there any way to estimate the required RAM? E.g., I have a dataset where I split the data into 25, which did not fix the problem. Only reducing max peptide size from 50 to 25 that solved the issue, but I could only solve it with trial and error.
Unfortunately, no. One trick is that you need to set the mass calibration to "None" if your search space is very big, because the first search of the mass calibration does not split the database.
Best,
Fengchao
Question: For label-free data, fragpipe offers to export MSSTATS compatible output ( which is really awesome ). Do you think you could also support this for MSstatsTMT ? The input file format is different and requires a lot more info than currently can be specified in fragpipe. So while you can currently set VALIDATION -> GENERATE MSSTATS FILES to TRUE while doing a TMT analysis, it does not generate an msstats output that is compatible with msstats(TMT). Alternatively it would be nice to see in fragpipe that the msstats output can only be generated for non-TMT data.
and off topic: TMT18 plex support would be awesome!
Best Klemens