Closed chhh closed 2 months ago
Here's a good example: https://github.com/MannLabs/directlfq?tab=readme-ov-file#generic-input-format
Dear Dimitri,
Here you can find a get-started example: https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html
I will link this vignette from our README.md since, indeed, this is essential information.
I would greatly appreciate your feedback on how we can improve the vignette. Your input will be invaluable in this process. I am also happy to provide more detailed examples ASAP for your specific input.
@wolski did we not have a vignette on "analyzing a generic protein matrix" once? -> this could be helpful here as well..
@wolski did we not have a vignette on "analyzing a generic protein matrix" once? -> this could be helpful here as well..
Yes https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html section:
Creating a configuration for a file in wide data format
Although I think that for @chhh, the section "Creating a configuration for a file in long data format.", is relevant, since there we start from peptide-level data in a long format:
The relevant code is:
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$fileName = "sample"
atable$workIntensity = "abundance"
atable$hierarchy[["proteinID"]] <- "proteinID"
atable$hierarchy[["peptideID"]] <- "peptideID"
atable$factors[["group"]] <- "group"
config <- prolfqua::AnalysisConfiguration$new(atable)
analysis_data <- prolfqua::setup_analysis(dataLongFormat, config)
lfqdata <- prolfqua::LFQData$new(analysis_data, config)
The required columns in the input data.frame are proteinID, peptideID, group, abundance, and sample.
We are currently moving the methods for importing data from MaxQuant, DIA-NN, and FragPipe outputs out of prolfqua to the prolfquapp package: https://github.com/prolfqua/prolfquapp. Maybe you find some of them useful.
@jjGG, we can have a look at: https://sage-docs.vercel.app/docs/configuration/quantification
Adding an adapter to their quant output in prolfquapp
should be easy. I just need a sage output (@chhh, can you share one with us, please?)
out format description
@wolski @jjGG Thank you guys, that link is very helpful: https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html
It does assume protein grouping already done though. Many tools report peptide level quant it would be nice to provide instruction for that use case as well.
As for Sage example - here's one for-prolfqua.zip. I only searched a single small file to make it attachable here.
Thank you a lot for sharing the output of Sage.
We also start most of the analysis from the peptide or precursor level abundances.
The second example "Creating a configuration for a file in long data format" on https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html starts from peptide data, although in a long format.
How to roll up peptide data to protein level data is discussed here: https://fgcz.github.io/prolfqua/reference/LFQDataAggregator.html
In short, the LFQData object has a method get_Aggregator()
.
The LFQDataAggregator has $4$ methods to roll up the data
LFQDataTransformer$intensity_array(.func = yourfunc, force = TRUE)
Once you have your peptide data in an lfqdata
instance, the process is straightforward. You simply:
ag <- lfqdata$get_Aggregator()
lfqProtData <- ag$medpolish()
lfqProtData
will have then the protein abundance estimates.
The aggregator also has a plot() method to generate the peptide/protein plots for all the proteins.
@wolski This is great, all of this should go to the front page README, I think :)
I will be giving it a try after ASMS - are you going?
@chhh
Here are examples of how to set a prolfqua LFQData object for either the "lfq.tsv" file or "results.sage.tsv"
"unzip("for-prolfqua.zip", list=TRUE)
# for lfq.tsv
library(tidyverse)
inputFile <- readr::read_tsv(unz("for-prolfqua.zip", filename="lfq.tsv"))
# transform to long
inputFileLong <- inputFile |> tidyr::pivot_longer(cols=ends_with(".raw.mzML"), values_to = "Intensity", names_to = "filename")
# create annotation
annotation <- data.frame(filename = "file_01.raw.mzML", name = "Sample_A", group = "Control")
inputData <- inner_join(annotation, inputFileLong, by = "filename")
# annotate columns
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$hierarchy[["protein"]] = "proteins"
atable$hierarchy[["peptide"]] = c("peptide")
atable$ident_qValue <- "q_value"
atable$ident_Score <- "score"
atable$fileName <- "filename"
atable$sampleName <- "name"
atable$factors[["TreatmentGroup"]] <- "group"
atable$set_response("Intensity")
config <- prolfqua::AnalysisConfiguration$new(atable)
data <- prolfqua::setup_analysis(inputData, config)
lfqdata <- prolfqua::LFQData$new(data, config)
lfqdata$hierarchy_counts()
pl <- lfqdata$get_Plotter()
pl$intensity_distribution_density()
Here, you first need to aggregate the PSM's for each precursor. But this example can also be adapted to FragPipe psm.tsv files.
inputFile <- readr::read_tsv(unz("for-prolfqua.zip", filename="results.sage.tsv"))
inputFile |> nrow()
inputFile |> select(filename, proteins, peptide, charge) |> distinct() |> nrow()
# roll up psm ms2_intensities
inputFile2 <- inputFile |> group_by(filename, proteins, peptide, charge, protein_q) |>
summarize(Intensity = sum(ms2_intensity, rm.na = TRUE), nr_children = n())
annotation <- data.frame(filename = "file_01.raw.mzML", name = "Sample_A", group = "Control")
inputData <- inner_join(annotation, inputFile2)
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$hierarchy[["proteins_Id"]] = c("proteins")
atable$hierarchy[["peptides_Id"]] = c("proteins", "peptide")
atable$hierarchy[["precursor_Id"]] = c("proteins","peptide","charge")
atable$ident_qValue <- "protein_q"
atable$fileName <- "filename"
atable$sampleName <- "name"
atable$nr_children = "nr_children"
atable$factors[["TreatmentGroup"]] <- "group"
atable$set_response("Intensity")
config <- prolfqua::AnalysisConfiguration$new(atable)
data <- prolfqua::setup_analysis(inputData, config)
lfqdata <- prolfqua::LFQData$new(data, config)
lfqdata$hierarchy_counts()
pl <- lfqdata$get_Plotter()
pl$intensity_distribution_density()
Is your feature request related to a problem? Please describe. I'm frustrated when I look at cryptic R examples where complex files are materialized out of thin air. If my inputs don't exactly match the provided example - I guess the only way is to resort to reading source code to figure out what inputs are needed.
Describe the solution you'd like An example of how to prepare data compatible with the package from scratch - starting with something simple e.g. I have N1 LCMS LFQ files from condition A, N2 files from condition B, did a search with some search engine that reports quant information on peptide level (e.g. Sage search does not have an input adaptor yet, and it can report quant info). What do I do next? Which columns in the long table format are required?
Describe alternatives you've considered Not using the package.
Additional context It's a basic getting-started.