ftwkoopmans / msdap

MS-DAP: downstream analysis pipeline for quantitative proteomics
GNU General Public License v3.0
31 stars 7 forks source link

Encyclopedia proteingroups etc. #1

Closed szabzola closed 2 years ago

szabzola commented 3 years ago

Hi! Thanks for this nice work. If I import Encyclopedia results (.elib) in the protein results I have highly overlapping proteingroups (the same protein occurs in several groups). Encyclopedia may do protein inference at sample level (in .mzml.elib), and it may be only combined in quantitative reports? I tried to import quant report elib, but did not work (at leat 2 elibs are required). DIA-NN results seem to work OK, from that point of view, I am looking at the code, and exported result dump, to have a list of peptides used for DE. I can make pdf report of peptide profiles, according to description, but I would need a list, or annotation eg. in peptides.tsv. Can you suggest an easy way, of getting this, and may be filtering precursors/peptides quantitatively (eg. covariance, correlation) and not just based on number identifications? I have trouble with data tables/tibbles used... Best regards, looking forward to publication to be cited... Zoltan Szabo

ftwkoopmans commented 3 years ago

If I import Encyclopedia results (.elib) in the protein results I have highly overlapping proteingroups (the same protein occurs in several groups). Encyclopedia may do protein inference at sample level (in .mzml.elib), and it may be only combined in quantitative reports?

Yes maybe, I’m not a regular EncyclopeDIA user. The current MS-DAP implementation extracts peptide-level data, and peptide-to-proteingroup mappings, from each .elib file. If EncyclopeDIA indeed performs protein inference per sample, and doesn’t retroactively update these .elib files after project-wide protein inference, that would be a problem. I will process a few example files from the EncyclopeDIA Wiki later this week, run them through the MS-DAP import function, specifically look for this issue and see if/how we can amend.

However, if you could share your EncyclopeDIA output files (including the quantitative reports) and point me to some peptides/proteins that you observed as problematic, it’s a lot easier for me to figure out what is going on exactly.

but I would need a list, or annotation eg. in peptides.tsv. Can you suggest an easy way, of getting this, and may be filtering precursors/peptides quantitatively (eg. covariance, correlation) and not just based on number identifications? I have trouble with data tables/tibbles used...

If I understand well, you want to import some dataset (e.g. from DIA-NN) using MS-DAP and then apply your own R code to the peptide*sample matrix? Here’s an example of converting the MS-DAP peptide table from long-format to wide-format. It includes columns with protein-group IDs, “protein_id” and peptide IDs, followed by a numeric column with log2 intensity values for every sample;

library(msdap)
dataset = import_dataset_diann(filename = "c:/path/diann-report.tsv", use_normalized_intensities = T)
peptides_in_wide_format = dataset$peptides %>% pivot_wider(id_cols = c(protein_id, peptide_id), names_from = sample_id, values_from = intensity)

When applied to the LFQbench dataset (https://www.ebi.ac.uk/pride/archive/projects/PXD002952), this peptide_in_wide_format tibble would look like this (from your description, I assume this is the data/format you are looking for);

# A tibble: 41,896 x 8
   protein_id                                      peptide_id                      HYE124_TTOF6600_64var_lgillet_I150211_008 HYE124_TTOF6600_64var_lgillet_I150211_009 HYE124_TTOF6600_64var_lgillet_I150211_010 HYE124_TTOF6600_64var_lgillet_I150211_011 HYE124_TTOF6600_64var_lgillet_I150211_012 HYE124_TTOF6600_64var_lgillet_I150211_0~
   <chr>                                           <chr>                                                               <dbl>                                     <dbl>                                     <dbl>                                     <dbl>                                     <dbl>                                    <dbl>
 1 1/sp|P37108|SRP14_HUMAN                         AAAAAAAAAPAAAATAPTTAATTAATAAQ_3                                      9.27                                      9.07                                      9.08                                      9.01                                      9.02                                     9.06
 2 1/sp|P36578|RL4_HUMAN                           AAAAAAALQAK_2                                                       10.7                                      10.7                                      10.7                                      10.6                                      10.6                                     10.6 
 3 1/sp|P0A8D6|YMDB_ECOLI                          AAAAEIAVK_2                                                         NA                                         6.15                                     NA                                         6.09                                     NA                                        6.02
 4 2/tr|C8ZDU4|C8ZDU4_YEAS8/tr|C8Z8S1|C8Z8S1_YEAS8 AAAALAGGKK_2                                                         4.30                                      2.13                                      4.60                                      2.46                                      3.43                                     2.73
 5 1/sp|P00959|SYM_ECOLI                           AAAAPVTGPLADDPIQETITFDDFAK_3                                         5.42                                      6.68                                      5.47                                      6.57                                      5.35                                     6.45
 6 1/tr|C8ZBI7|C8ZBI7_YEAS8                        AAADALSDLEIKDSK_3                                                   12.4                                      11.2                                      12.2                                      11.2                                      12.1                                     11.2 
 7 1/tr|C8ZBI7|C8ZBI7_YEAS8                        AAADALSDLEIK_2                                                      11.0                                       9.93                                     10.9                                       9.82                                     10.9                                      9.81
 8 1/sp|P25738|MSYB_ECOLI                          AAADEWDER_2                                                          7.84                                      9.71                                      7.76                                      9.69                                      7.70                                     9.64
 9 1/sp|P00452|RIR1_ECOLI                          AAADLISR_2                                                          NA                                         6.98                                     NA                                         7.02                                      4.14                                     7.12
10 1/sp|P0AEG4|DSBA_ECOLI                          AAADVQLR_2                                                           7.53                                      9.31                                      7.36                                      9.30                                      7.39                                     9.19
# ... with 41,886 more rows

In case you want to run the MS-DAP pipeline (e.g. differential expression analysis with DEqMS or MSqRob, generate QC report, etc) with only the subset of peptides that pass your custom filtering routine, you could try something like this: 1) import dataset into MS-DAP 2) extract peptide data table from MS-DAP “dataset” object, then apply your own filtering routine to arrive at a set of peptide IDs that should be used 3) in the MS-DAP “dataset” object, delete all peptides that do not match your filtering criteria (output of step 2) 4) proceed with MS-DAP code as per usual

example R code snippet;

library(msdap)
dataset = import_dataset_diann(filename = "c:/path/diann-report.tsv", use_normalized_intensities = T)
dataset = import_sample_metadata(dataset, filename = "c:/path/sample_metadata.xlsx")

# convert to wide format, same as example above
peptides_in_wide_format = dataset$peptides %>% pivot_wider(id_cols = c(protein_id, peptide_id), names_from = sample_id, values_from = intensity)

# your filtering should go here.
# This example just returns the first 1000 peptides to illustrate the expected format/values within this code snippet
peptide_id__pass_filter = head(peptides_in_wide_format$peptide_id, 1000)

# subset the peptides in the MS-DAP dataset object; remove everything that is not in your accepted list
dataset$peptides = dataset$peptides %>% filter(peptide_id %in% peptide_id__pass_filter)

dataset = setup_contrasts(...)
dataset = analysis_quickstart(...)
ftwkoopmans commented 2 years ago

I've implemented a new EncyclopeDIA import function that uses the 'Quant Report' and tested it on the LFQbench dataset. Please try it and let me know whether this worked well for your dataset / use-case;

  1. copy/paste below function into the R console or your R script
  2. dataset = import_dataset_encyclopedia_v2_test(file_quant_report = "C:/path/to/quantreport.elib") (this should be all it takes to load your EncyclopeDIA quant report into MS-DAP)
  3. Because the 'Quant Report' doesn't include peptide confidence scores per sample (only 1 confidence score per peptide over all samples), you can optionally import those from the individual sample .elib files generated by EncyclopeDIA by providing the path to your samples / .elibs with the path_elib_rawfiles parameter
#' Import a label-free proteomics dataset from EncyclopeDIA
#'
#' This function imports peptide intensities, retention times and peptide-to-protein mappings from a 'Quant Report' generated by EncyclopeDIA.
#'
#' Because this 'Quant Report' does not contain confidence scores per peptide per individual sample, these can optionally be imported from the individual .elib files generated by EncyclopeDIA while processing the individual samples / raw files.
#' This will enable MS-DAP to discriminate between match-between-runs hits and peptides that were detected/identified in a sample, and consequently use this information in peptide filtering (for instance by removing peptides that are mostly based on MBR hits).
#'
#' @param file_quant_report the EncyclopeDIA 'Quant Report' .elib file
#' @param path_elib_rawfiles optional. the directory that contains the EncyclopeDIA .elib files matching each sample (mzML/dia input file to EncyclopeDIA)
#' @param confidence_threshold confidence score threshold at which a peptide is considered 'identified' (target value must be lesser than or equals)
#' @param return_decoys logical indicating whether to return decoy peptides. <not available for EncyclopeDIA at this point>
#' @importFrom RSQLite SQLite
#' @importFrom DBI dbIsValid dbExistsTable dbConnect dbListTables dbGetQuery dbDisconnect
#' @importFrom data.table as.data.table
#' @export
import_dataset_encyclopedia_v2_test = function(file_quant_report, path_elib_rawfiles = NA, confidence_threshold = 0.01, return_decoys = FALSE) {
  reset_log()
  append_log("reading EncyclopeDIA elib ...", type = "info")

  # input validation
  check_parameter_is_string(file_quant_report)
  if (!file.exists(file_quant_report)) {
    append_log(paste("Cannot find EncyclopeDIA 'Quant Report' .elib file at", file_quant_report), type = "error")
  }

  ### extract individual peptide*sample confidence scores from .elib files matching every input sample
  peptide_confidence = peptide_confidence_unique_sample_id = NULL
  # this step is optional, check if a valid parameter was provided
  if(length(path_elib_rawfiles) == 1 && !is.na(path_elib_rawfiles) && is.character(path_elib_rawfiles)) {
    # find files with elib extension
    files = dir(path_elib_rawfiles, "\\.elib$", full.names = T, ignore.case = T)

    # input validation
    if (length(files) < 2) {
      append_log(paste("There must be at least 2 .elib files at 'path_elib_rawfiles' parameter (skip this optional  parameter if files are unavailable):", path_elib_rawfiles), type = "error")
    }

    for(filename in files) {
      append_log(paste("parsing EncyclopeDIA elib file:", filename, "..."), type = "progress")

      # connect to database  +  catch errors
      db = tryCatch(DBI::dbConnect(RSQLite::SQLite(), filename, synchronous = NULL), error=function(x) NULL)
      if(length(db) == 0 || !DBI::dbIsValid(db) || tryCatch({DBI::dbExistsTable(db,"test"); FALSE}, error=function(x){TRUE})) {
        append_log(paste("Input file is not an SQLite database (which the EncyclopeDIA elib file should be):", filename), type = "error")
      }

      # extract data from SQLite database
      p = as_tibble(tryCatch(DBI::dbGetQuery(db, 'SELECT SourceFile AS sample_id, PeptideModSeq AS sequence_modified, PeptideSeq AS sequence_plain, PrecursorCharge AS charge, QValue AS confidence, IsDecoy as isdecoy FROM peptidescores WHERE isdecoy = 0', error=function(x) NULL)))

      # check against empty result
      if (nrow(p) == 0) {
        append_log(paste("Empty SQLite database table 'peptidescores':", filename), type = "error")
      }

      # check against multiple sample_id (e.g.elib summarizes over multiple raw files)
      if(n_distinct(p$sample_id) != 1) {
        append_log(paste("Skipping input that contains summarized data (eg. a EncyclopeDIA 'chromatogram library' or 'Quant Report'):", filename), type = "warning")
      } else {

        # append to results
        peptide_confidence = bind_rows(peptide_confidence, p)

        # check against duplicates; user has multiple .elib files for one mzML input file so we don't know which holds the appropriate confidence scores  (perhaps the input path/directory holds results from multiple EncyclopeDIA runs with different parameters)
        idx_sample_id = match(p$sample_id[1], peptide_confidence_unique_sample_id$sample_id)
        if(!is.na(idx_sample_id)) {
          append_log(paste(filename, "is a redundant .elib file; contains peptide data for sample", p$sample_id[1], "that was already parsed before in", peptide_confidence_unique_sample_id$filename[idx_sample_id]), type = "error")
        }
        peptide_confidence_unique_sample_id = bind_rows(peptide_confidence_unique_sample_id, tibble(sample_id=p$sample_id[1], filename=filename))
      }
      rm(p)

      # close DB connection
      DBI::dbDisconnect(db)
    }
  } else {
    append_log("Input file path with EncyclopeDIA .elib files of individual samples was not provided, will infer confidence scores at peptide-level (from 'Quant Report') instead", type = "warning")
  }

  ### quant report
  append_log(paste("parsing EncyclopeDIA 'Quant Report' elib file:", file_quant_report, "..."), type = "progress")

  # connect to database  +  catch errors
  db = tryCatch(DBI::dbConnect(RSQLite::SQLite(), file_quant_report, synchronous = NULL), error=function(x) NULL)
  if(length(db) == 0 || !DBI::dbIsValid(db) || tryCatch({DBI::dbExistsTable(db,"test"); FALSE}, error=function(x){TRUE})) {
    append_log(paste("Input file is not an SQLite database (which the EncyclopeDIA elib file should be):", file_quant_report), type = "error")
  }

  # peptide*sample intensities
  peptides = as_tibble(tryCatch(DBI::dbGetQuery(db, 'SELECT SourceFile AS sample_id, PeptideModSeq AS sequence_modified, PeptideSeq AS sequence_plain, PrecursorCharge AS charge, RTInSecondsCenter AS rt, TotalIntensity AS intensity FROM peptidequants', error=function(x) NULL)))

  # check against empty result
  if (nrow(peptides) == 0) {
    append_log(paste("Empty SQLite database table 'peptidequants':", file_quant_report), type = "error")
  }

  # if there are confidence scores from individual .elib files, use those
  if(length(peptide_confidence) > 0) {
    # check against missing files
    sid_missing = setdiff(unique(peptides$sample_id), peptide_confidence$sample_id)
    if(length(sid_missing) > 0) {
      append_log(paste("Samples present in the 'Quant Report' .elib are missing from the set of individual sample .elib files in input directory:", paste(sid_missing, collapse=", ")), type = "error")
    }

    # check against redundant entries, don't trust DB table won't change in the future
    peptide_confidence = peptide_confidence %>% arrange(confidence) %>% distinct(sample_id, sequence_modified, charge, .keep_all = T)
    # join peptide confidences with main peptide quant results
    peptides = peptides %>% left_join(peptide_confidence %>% select(sample_id, sequence_modified, charge, confidence), by=c("sample_id", "sequence_modified", "charge"))
    # MBR hits won't have a confidence value, set to 1
    peptides$confidence[!is.finite(peptides$confidence)] = 1

  } else {
    # we have to make due with confidence scores from this .elib; join at peptide-level NOT at peptide*sample level (because such data is unavailable in a 'Quant Report')
    pepscores = as_tibble(tryCatch(DBI::dbGetQuery(db, 'SELECT PeptideModSeq AS sequence_modified, PrecursorCharge AS charge, QValue AS confidence FROM peptidescores', error=function(x) NULL)))
    # check against redundant entries, don't trust DB table won't change in the future
    pepscores = pepscores %>% arrange(confidence) %>% distinct(sequence_modified, charge, .keep_all = T)
    # join peptide confidences with main peptide quant results
    peptides = peptides %>% left_join(pepscores, by=c("sequence_modified", "charge"))
    # to be safe, just assume missing confidence values are actually "great confidence" because here we use peptide-level confidence values and every peptide should be present in the input table  (so most likely a technical error or fail assumption in our parsing of the .elib)
    peptides$confidence[!is.finite(peptides$confidence)] = 0
  }

  ### pep2prot
  pep2prot = as_tibble(tryCatch(DBI::dbGetQuery(db, 'SELECT PeptideSeq AS sequence_plain, ProteinAccession AS protein_id, isDecoy AS isdecoy FROM peptidetoprotein WHERE isdecoy = 0', error=function(x) NULL)))

  # check against empty result
  if (nrow(pep2prot) == 0) {
    append_log(paste("Empty SQLite database table 'peptidetoprotein':", file_quant_report), type = "error")
  }

  # stable sorting of protein_id to prevent different protein-groups with same proteins but in different order (e.g. one might end up with protein-group A;B;C, A;C;B and B;C;A). this implementation does maintain order as provided by upstream software at least (e.g. "expected" leading protein)
  uprot_ordered = pep2prot %>% distinct(protein_id) %>% pull()

  pep2prot = pep2prot %>%
    distinct_all() %>%
    arrange(match(protein_id, uprot_ordered)) %>%
    group_by(sequence_plain) %>%
    summarise(protein_id = paste0(protein_id, collapse=";") )

  # add protein_id and decoy flag to peptides tibble
  peptides = left_join(peptides, pep2prot, by="sequence_plain")

  peptides = peptides %>% mutate(
    # convert peptide RT from seconds to minutes
    rt = rt / 60,
    # decoys removed upsteam (they're not in quant reports anyway, so we ignore @ parsing code)
    isdecoy = F)

  ### to MS-DAP dataset
  # ! re-use our generic data table parser to ensure input data standardization and maintain consistency among all data parsing routines
  # functions include; enforce data types (eg; isdecoy as boolean), format data (eg; strip extensions from filenames) and remove invalid rows (eg; intensity=0)
  # also note that, at current default settings, this function selects the 'best' peptide_id for each sequence_modified (eg; if the same sequence was found with multiple charges)
  ds = import_dataset_in_long_format(x = peptides,
                                     attributes_required = list(sample_id = "sample_id",
                                                                protein_id = "protein_id",
                                                                sequence_plain = "sequence_plain",
                                                                sequence_modified = "sequence_modified",
                                                                charge = "charge",
                                                                rt = "rt",
                                                                isdecoy = "isdecoy",
                                                                intensity = "intensity",
                                                                confidence = "confidence"),
                                     confidence_threshold = confidence_threshold,
                                     return_decoys = return_decoys,
                                     do_plot = F) # data for Cscore histograms is not available in EncyclopeDIA, so disable plotting

  ds$acquisition_mode = "dia"
  return(ds)
}
szabzola commented 2 years ago

Thanks a lot for the quick and thorough answers. Sorry, I did not have time to collect and send my files, and you gave solutions faster... I am testing your suggestions, will get back with conclusions. Thanks again