chhh commented 2 months ago

Is your feature request related to a problem? Please describe. I'm frustrated when I look at cryptic R examples where complex files are materialized out of thin air. If my inputs don't exactly match the provided example - I guess the only way is to resort to reading source code to figure out what inputs are needed.

Describe the solution you'd like An example of how to prepare data compatible with the package from scratch - starting with something simple e.g. I have N1 LCMS LFQ files from condition A, N2 files from condition B, did a search with some search engine that reports quant information on peptide level (e.g. Sage search does not have an input adaptor yet, and it can report quant info). What do I do next? Which columns in the long table format are required?

Describe alternatives you've considered Not using the package.

Additional context It's a basic getting-started.

chhh commented 2 months ago

Here's a good example: https://github.com/MannLabs/directlfq?tab=readme-ov-file#generic-input-format

wolski commented 2 months ago

Dear Dimitri,

Here you can find a get-started example: https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html

I will link this vignette from our README.md since, indeed, this is essential information.

I would greatly appreciate your feedback on how we can improve the vignette. Your input will be invaluable in this process. I am also happy to provide more detailed examples ASAP for your specific input.

jjGG commented 2 months ago

@wolski did we not have a vignette on "analyzing a generic protein matrix" once? -> this could be helpful here as well..

wolski commented 2 months ago

@wolski did we not have a vignette on "analyzing a generic protein matrix" once? -> this could be helpful here as well..

Yes https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html section:

Creating a configuration for a file in wide data format

Although I think that for @chhh, the section "Creating a configuration for a file in long data format.", is relevant, since there we start from peptide-level data in a long format:

The relevant code is:

atable <- prolfqua::AnalysisTableAnnotation$new()
atable$fileName = "sample"
atable$workIntensity = "abundance"
atable$hierarchy[["proteinID"]]    <-  "proteinID"
atable$hierarchy[["peptideID"]]    <-  "peptideID"
atable$factors[["group"]] <- "group"
config <- prolfqua::AnalysisConfiguration$new(atable)
analysis_data <- prolfqua::setup_analysis(dataLongFormat, config)
lfqdata <- prolfqua::LFQData$new(analysis_data, config)

The required columns in the input data.frame are proteinID, peptideID, group, abundance, and sample.

We are currently moving the methods for importing data from MaxQuant, DIA-NN, and FragPipe outputs out of prolfqua to the prolfquapp package: https://github.com/prolfqua/prolfquapp. Maybe you find some of them useful.

wolski commented 2 months ago

@jjGG, we can have a look at: https://sage-docs.vercel.app/docs/configuration/quantification

Adding an adapter to their quant output in prolfquapp should be easy. I just need a sage output (@chhh, can you share one with us, please?)

out format description

https://sage-docs.vercel.app/docs/results/lfq

chhh commented 1 month ago

@wolski @jjGG Thank you guys, that link is very helpful: https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html

It does assume protein grouping already done though. Many tools report peptide level quant it would be nice to provide instruction for that use case as well.

As for Sage example - here's one for-prolfqua.zip. I only searched a single small file to make it attachable here.

wolski commented 1 month ago

Thank you a lot for sharing the output of Sage.

We also start most of the analysis from the peptide or precursor level abundances.

The second example "Creating a configuration for a file in long data format" on https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html starts from peptide data, although in a long format.

How to roll up peptide data to protein level data is discussed here: https://fgcz.github.io/prolfqua/reference/LFQDataAggregator.html

In short, the LFQData object has a method get_Aggregator(). The LFQDataAggregator has $4$ methods to roll up the data

medpolish (Tukeys Median - what MSstats does)
lmrob (What msqrob does but you should log transform first, you can use LFQDataTransformer$intensity_array(.func = yourfunc, force = TRUE)
mean_topN
sum_topN

Once you have your peptide data in an lfqdata instance, the process is straightforward. You simply:

ag <- lfqdata$get_Aggregator()
lfqProtData <- ag$medpolish()

lfqProtData will have then the protein abundance estimates. The aggregator also has a plot() method to generate the peptide/protein plots for all the proteins.

chhh commented 1 month ago

@wolski This is great, all of this should go to the front page README, I think :)

I will be giving it a try after ASMS - are you going?

wolski commented 1 month ago

@chhh

Here are examples of how to set a prolfqua LFQData object for either the "lfq.tsv" file or "results.sage.tsv"

lfq.tsv example

"unzip("for-prolfqua.zip", list=TRUE)
# for lfq.tsv
library(tidyverse)
inputFile <- readr::read_tsv(unz("for-prolfqua.zip", filename="lfq.tsv"))
# transform to long 
inputFileLong <- inputFile |> tidyr::pivot_longer(cols=ends_with(".raw.mzML"), values_to = "Intensity", names_to = "filename")

# create annotation
annotation <- data.frame(filename = "file_01.raw.mzML", name = "Sample_A", group = "Control")

inputData <- inner_join(annotation, inputFileLong, by = "filename")

# annotate columns
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$hierarchy[["protein"]] = "proteins"
atable$hierarchy[["peptide"]] = c("peptide")
atable$ident_qValue <- "q_value"
atable$ident_Score <- "score"
atable$fileName <- "filename"
atable$sampleName <- "name"
atable$factors[["TreatmentGroup"]] <- "group"
atable$set_response("Intensity")

config <- prolfqua::AnalysisConfiguration$new(atable)
data <- prolfqua::setup_analysis(inputData, config)

lfqdata <- prolfqua::LFQData$new(data, config)

lfqdata$hierarchy_counts()
pl <- lfqdata$get_Plotter()
pl$intensity_distribution_density()

results.sage.tsv example

Here, you first need to aggregate the PSM's for each precursor. But this example can also be adapted to FragPipe psm.tsv files.


inputFile <- readr::read_tsv(unz("for-prolfqua.zip", filename="results.sage.tsv"))
inputFile |> nrow()

inputFile |> select(filename, proteins, peptide, charge) |> distinct() |> nrow()
# roll up psm ms2_intensities
inputFile2 <- inputFile |> group_by(filename, proteins, peptide, charge, protein_q) |> 
    summarize(Intensity = sum(ms2_intensity, rm.na = TRUE), nr_children = n())
annotation <- data.frame(filename = "file_01.raw.mzML", name = "Sample_A", group = "Control")
inputData <- inner_join(annotation, inputFile2)
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$hierarchy[["proteins_Id"]] = c("proteins")
atable$hierarchy[["peptides_Id"]] = c("proteins", "peptide")
atable$hierarchy[["precursor_Id"]] = c("proteins","peptide","charge")

atable$ident_qValue <- "protein_q"
atable$fileName <- "filename"
atable$sampleName <- "name"
atable$nr_children = "nr_children"
atable$factors[["TreatmentGroup"]] <- "group"
atable$set_response("Intensity")
config <- prolfqua::AnalysisConfiguration$new(atable)

data <- prolfqua::setup_analysis(inputData, config)
lfqdata <- prolfqua::LFQData$new(data, config)

lfqdata$hierarchy_counts()
pl <- lfqdata$get_Plotter()
pl$intensity_distribution_density()

fgcz / prolfqua

Please provide an example how to create input data by hand #78

lfq.tsv example

results.sage.tsv example