YasinEl / mzRAPP

Tool for reliability assessment of omics peprocessing
GNU General Public License v3.0
10 stars 0 forks source link

mzRAPP

R-CMD-check

The goal of mzRAPP is to allow reliability assessment of non-targeted data pre-processing (NPP; XCMS, XCMS3, MetaboanalystR 3.0, SLAW, XCMS-online, MZmine 2, MZmine 3, MS-DIAL, OpenMS, El-MAVEN,..) in the realm of liquid chromatography high-resolution mass spectrometry (LC-HRMS). mzRAPPs approach is based on the increasing popularity of merging non-targeted with targeted metabolomics meaning that both types of data evaluation are often performed on the same dataset. Following this assumption mzRAPP can utilize user-provided information on a set of molecules (the more molecules the better) with known retention behavior. mzRAPP extracts and validates chromatographic peaks for which boundaries are provided for all (enviPat predicted) isotopologues of those target molecules directly from mzML files. The resulting benchmark dataset is used to extract different performance metrics for NPP performed on the same mzML files. An overview of mzRAPPs capabilities is given in this < 3 min youtube video, which was formerly recorded for the Metabolomics2020 conference.

Installation

First, install the most recent version of

  1. R
  2. R Studio and
  3. Rtools (only necessary for the generation of reports, not the use of mzRAPP).

In the case of Rtools make sure you also follow the subsequent instructions described on the webpage.

Afterward, it is time to install mzRAPP. It is possible to do that by pasting the following code into the console of R Studio and press enter. Sometimes you are asked if you want to update packages before installing mzRAPP. In that case, select the option “None” since this often leads to problems with the installation. If you want to update the packages just do so before or after installing mzRAPP.

if("devtools" %in% rownames(installed.packages()) == FALSE) {install.packages("devtools")}
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true") #this is necessary because of a problem with the mzR package
devtools::install_github("YasinEl/mzRAPP", dependencies = TRUE, build_vignettes = TRUE)

Afterwards you can run mzRAPP using:

library(mzRAPP)
callmzRAPP()

or use mzRAPP without the shiny interface as described below.

Use examples

If you want to go through some examples on how to use mzRAPP and interpret results you can go through some use-cases using the code below. However, more general explanations are provided in the chapter ‘Generation and interpretation of NPP performance metrics’ of this readme.

library(mzRAPP)
vignette("Vignette_mzRAPP_Example_workflow")

Citation

Please cite the following publication if you use mzRAPP in your workflow:
Yasin El Abiead, Maximilian Milford, Reza M Salek, Gunda Koellensperger, mzRAPP: a tool for reliability assessment of data pre-processing in non-targeted metabolomics, Bioinformatics, 2021;, btab231, https://doi.org/10.1093/bioinformatics/btab231

Benchmark dataset generation

To get started mzRAPP the user has to provide list of molecules with known molecular composition and retention time (RT) boundaries. mzRAPP generates extracted ion chromatograms for all isotopologues predictable for those molecules and applies provided RT boundaries. Only isotopologues for which the theoretically most abundant and at least one additional isotopologue are found are considered for the final benchmark. Isotopologue peaks with an area or height which is more than 30% off the predicted value or a Pearson Correlation coef < 0.85 (as compared to the highest isotopologue) are removed. Since start- and end-time has to be provided for each compound it is advisable to set those boundaries using a tool for manual peak curation from which peak boundaries can be exported. One example of this would be Skyline. Boundaries can be provided per compound or file and compound as described below.


Select mzML files

In order to generate a benchmark you need to provide your centroided mzML files. Conversion of files of different vendors to mzML as well as centroiding can be done by Proteowizards MSconvert.


Select sample-group file

This csv file (click here for an example) should contain two columns:

sample_name: names of all mzML files from which peaks should be extracted (with our without file extension (.mzML))
sample_group: group labels of the respective samples (e.g. treated, untreated,..). If there is only one group this still needs to be filled out.

Select target file

This csv file should contain information on the target molecules/peaks (click here for an example) and include the following columns:

molecule: names of target molecules (should be unique identifiers)
SumForm_c: Molecular composition of the neutral molecule (e.g. C10H15N5O10P2). Please make sure there is never a 0 behind an element like behind the N in C12H8N0S2.
main_adduct: One main adduct has to be defined for each molecule (e.g. M+H). If the main_adduct is not detected also other adducts wont be accepted. Therefore it makes sense to select the most trusted adduct (generally M+H or M-H) as main adduct. All adducts enabled in the enviPat package are allowed:

library(enviPat)
#> 
#>  
#>  Welcome to enviPat version 2.4 
#>  Check www.envipat.eawag.ch for an interactive online version
data(adducts)
adducts$Name
#>  [1] "M+H"            "M+NH4"          "M+Na"           "M+K"           
#>  [5] "M+"             "M-H"            "M-2H"           "M-3H"          
#>  [9] "M+FA-H"         "M+Hac-H"        "M-"             "M+3H"          
#> [13] "M+2H+Na"        "M+H+2Na"        "M+3Na"          "M+2H"          
#> [17] "M+H+NH4"        "M+H+Na"         "M+H+K"          "M+ACN+2H"      
#> [21] "M+2Na"          "M+2ACN+2H"      "M+3ACN+2H"      "M+CH3OH+H"     
#> [25] "M+ACN+H"        "M+2Na-H"        "M+IsoProp+H"    "M+ACN+Na"      
#> [29] "M+2K-H"         "M+DMSO+H"       "M+2ACN+H"       "M+IsoProp+Na+H"
#> [33] "2M+H"           "2M+NH4"         "2M+Na"          "2M+3H2O+2H"    
#> [37] "2M+K"           "2M+ACN+H"       "2M+ACN+Na"      "M-H2O-H"       
#> [41] "M+Na-2H"        "M+Cl"           "M+K-2H"         "M+Br"          
#> [45] "M+TFA-H"        "2M-H"           "2M+FA-H"        "2M+Hac-H"      
#> [49] "3M-H"

user.rtmin: Start time of peak (seconds). If possible mzRAPP will narrow peak boundaries to intersect with the extracted ion chromatogram at 5% of the maximum peak height. It is also worth noting that peaks for which user.rtmin/user.rtmax are provided will still be rejected if the isotopic information is not fitting.
user.rtmax: End time of peak (seconds).
adduct_c: (optional) If this column is added there has to be at least one row per molecule where adduct_c is the same as the selected main_adduct. After that additional rows with different adducts can be added. Please note that adducts which should be screened for all molecules can be selected more easily using the adduct-selection-box (see below). Therefore is makes only sense to use the adduct_c column if you want some adducts to be only screened for specific molecules.
StartTime.EIC: (optional) Start time for chromatograms extracted for this molecule (seconds). Peaks are only detected from this time on. If not given StartTime.EIC and EndTime.EIC are calculated from user.rtmin and user.rtmax.
EndTime.EIC: (optional) End time for chromatograms extracted for this molecule (seconds). Peaks are only detected up to this time.
FileName: (optional) Name of sample file with or without file extension. Using this allows to apply different values (like user.rtmin/user.rtmax) for different files.
Additional columns: (optional) It is possible to add additional columns. Those will be kept for the final benchmark dataset.

Select instrument and resolution

This is necessary in order to apply the correct mass resolution for any given m/z value when isotopologues are predicted for different molecular formulas. All instruments enabled via the enviPat package can be selected from the envipat resolution list. For other instruments a custom resolution list has to be uploaded as .csv file. This .csv file has to consist of two columns:
R: Resolution value at half height of a mass peak
m/z: m/z value for the corresponding resolution
Resolution values for at least 10 equally distributed m/z values are recommended.

Select additional adducts

You can select any number of additional adducts to screen for. These adducts will be screened for each molecule if the selected main_adduct has been found. Please note that (as for the main_adduct) adducts will only be added if at least two isotoplogues of the respective adduct can be detected.

Setting parameters

In a next step a few parameters have to be set:
Lowest iso. to be considered [%]: Lowest relative isotopologue abundance to be considered for each molecule (>= 0.05).
Min. # of scans per peak: Minimum number of points for a chromatographic peak to be considered as such.
mz precision [ppm]: Maximum spread of mass peaks in the mz dimension to be still considered part of the same chromatogram.
mz accuracy [ppm]: Maximum difference between the accurate mz of two ion traces to be considered to be originating from the same ion.
Processing plan: How should the benchmark generation be done? sequential (only using one core; often slow but does not use much RAM) or multiprocess (using multiple cores; faster but needs more RAM)

Starting benchmark generation

Benchmark generation can be started using the blue Start button. The necessary time for the generation depends on the number of mzML files, the number of target compounds, and of course computational resources. Typically this process takes minutes to hours. Afterward, the generated benchmark dataset is automatically exported to the working directory as csv file.

How to check the benchmark

An overview of different benchmark key data is provided in the “View Benchmark” panel. The plots can be used to inspect different qualities of the dataset. For information on the peak variables calculated please check ?mzRAPP::find_bench_peaks. Also, please note that a molecule not being detected does not necessarily mean that there is no peak, but that mzRAPP was not able to validate it. This could happen since at least two isotopologues, fulfilling strict criteria regarding abundance, peak shape correlation, and the number of points per peak is required for a given molecule to be retained in the benchmark. To get a better overview of picked peaks two csv files as well as instructions for their application in Skyline can be exported. Those can (but do not have to) be used to generate a mirror image of the benchmark dataset in the free software, Skyline.

When the benchmark is satisfactory it can be used for reliability assessment of non-targeted data pre-processing as explained in the following chapter.

Generate a benchmark via R-script

#load packages
library(mzRAPP)
library(enviPat)

#load resolution data & select instrument/resolution
data("resolution_list") 
res_list <- resolution_list[["OTFusion,QExactiveHF_120000@200"]]

#calculate theoretic mz values and abundances for all isotopologues at the given mass resolution using enviPat
Th_isos <- get_mz_table(targets, #table with information on target molecules as described above
                      instrumentRes = res_list
                      )

#find regions of interest (ROIs)/for theoretic isotopoplogues
rois <- get_ROIs(files = files, #vector of mzML file names (including paths)
                      Target.table = Th_isos,
                      PrecisionMZtol = 5, #mass precision of the used mass spectrometer
                      AccurateMZtol = 5 #mass accuracy of the used mass spectrometer
                      )

#Extract/evaluate peaks
PCal <- find_bench_peaks(files = files, #vector of mzML file names (including paths) as described above
                      Grps = grps, #table including information on group assignment of each provided mzML file as described above
                      CompCol_all = rois,
                      Min.PointsperPeak = 8, #minimum number of points expected from a given peak
                      max.mz.diff_ppm = 5 #mass accuracy of the used mass spectrometer
                      )

#save the resulting benchmark to a csv file
data.table::fwrite(PCal, file = "Peak_list.csv", row.names = FALSE)

Reliability assessment of non-targeted data pre-processing

Reliability assessment of NPP can be set up in the panel “Setup NPP assessment”. First, the tool to be evaluated has to be set. Afterward, the unaligned and aligned output files of the tools to be assessed have to be selected. The way those files can be exported from different tools is lined out in the following. In the case of all those outputs, it should be noted that mzRAPP is heavily dependent on all isotopologues being reported in the output feature tables. For that reason, they must not be filtered out for mzRAPP to work properly.

Exporting NPP outputs from different tools

XCMS (R-version):

#unaligned file:
data.table::fwrite(xcms::peaks(xcmsSet_object), "unaligned_file.csv")

#aligned file:
data.table::fwrite(xcms::peakTable(xcmsSet_object), "aligned_file.csv")


XCMS3 (R-version):

#unaligned file:
data.table::fwrite(xcms::chromPeaks(XCMSnExp_object), "unaligned_file.csv")

#aligned file:
feature_defs <- xcms::featureDefinitions(XCMSnExp_object)
feature_areas <- xcms::featureValues(XCMSnExp_object, value = "into")

df <- merge(feature_defs, feature_areas, by=0, all=TRUE)
df <- df[,!(names(df) %in% c("peakidx"))]

data.table::fwrite(df, "aligned_file.csv")


XCMS online:
When starting a run on XCMS online make sure that retention times are always given in seconds. After processing download all results from XCMS online via the button “Download Results”. Afterwards extract all Results from the zipped folder.
unaligned file: select the xcms3xset.Rda file
aligned file: select the same xcms3xset.Rda file

MS-DIAL:
unaligned files: Export -> Peak list result -> [Add all files] -> [set Export format to txt]
aligned file: When performing the alignment make sure to activate the isotope tracking option in the alignment step (for most cases selecting 13C and 15N as labeling elements will be adequate). Afterwards export via: Export -> Alignment result -> [check Raw data matrix Area] -> [set Export format to msp]

MZmine 2:
unaligned files: [select all files generated in the chromatogram deconvolution step] -> Feature list methods -> Export/Import -> Export to CSV file -> [set Filename including pattern/curly brackets (e.g. blabla_{}_blabla.csv)] -> [check “Peak name”, “Peak height”, “Peak area”, “Peak RT start”, “Peak RT end”, “Peak RT”, “Peak m/z”, “Peak m/z min” and “Peak m/z max”] -> [set Filter rows to ALL]
aligned file: [select file after alignment step] -> Feature list methods -> Export/Import -> Export to CSV file -> [additional to checks set for unaligned files check “Export row retention time” and “Export row m/z”]

MZmine 3:
unaligned files: [select all files generated in the chromatogram deconvolution step] -> Feature list methods -> Export feature list -> CSV (legacy MZmine 2) -> [set Filename including pattern/curly brackets (e.g. blabla_{}_blabla.csv)] -> [check “Feature name”, “Peak height”, “Peak area”, “Feature RT start”, “Feature RT end”, “Feature RT”, “Feature m/z”, “Feature m/z min” and “Feature m/z max”] -> [set Filter rows to ALL]
aligned file: [select file after alignment step] -> Feature list methods -> Export feature list -> CSV (legacy MZmine 2) -> [additional to checks set for unaligned files check “Export row retention time” and “Export row m/z”]

El-MAVEN:
unaligned file: [click the “Export csv” button in the “Peak Table”-panel] -> Export all groups -> [select “Peaks Detailed Format Comma Delimited (.csv)”]
aligned file: [click the “Export csv” button in the “Peak Table”-panel] -> Export all groups -> [select “Groups Summary Matrix Format Comma Delimited (.csv)”]

OpenMS:
When processing the FeatureFinderMetabo algorithm make sure to set local_rt_range as well as local_mz_range to 0. You will have to check ‘Show advanced parameters’ to make those parameters visible. Also set report_covex_hulls to true.
unaligned file: [Connect a TextExporter node with separator set to ‘,’ directly to the FeatureFinderMetabo node] -> [connect TextExporter to Output Folder]
aligned file: [Connect a TextExporter node with separator set to ‘,’ directly to the FeatureLinkerUnlabeledQT node] -> [connect TextExporter to Output Folder]

PatRoon:
patRoon is not supported directly but can still be loaded since it allows to generate xcmsSet objects internally. Hence, it has to be loaded into mzRAPP as “XCMS” output which has to be stated as such in the “Setup NPP assessment tab”.

xcmsSet_object <- patRoon::getXCMSSet(patRoon_features_object)

#unaligned file:
data.table::fwrite(xcms::peaks(xcmsSet_object), "unaligned_file.csv")

#aligned file:
data.table::fwrite(xcms::peakTable(xcmsSet_object), "aligned_file.csv")


MetaboanalystR 3.0:
MetaboanalystR 3.0 can be used to optimize XCMS parameters. It also allows subsequent data pre-processing within the Metaboanalyst framework leading to the slightly different output format compared to XCMS. After applying the ‘FormatPeakList’ function from MetaboanalystR the unaligned and aligned outputs for mzRAPP can be exported as follows:

#unaligned file:
UnalignedOP <- data.table::as.data.table(FormatPeakList_output@peakpicking[["chromPeaks"]])
data.table::fwrite(UnalignedOP, "unaligned_file.csv")

#aligned file:
AlignedOP <- data.table::as.data.table(FormatPeakList_output@dataSet)
data.table::fwrite(AlignedOP, "aligned_file.csv")


SLAW:
SLAW automatically exports all necessary files.
unaligned file: SLAW-output folder -> folder_named_like_used_algorithm (e.g. CENTWAVE) -> csv files in folder peaktables
aligned file: SLAW-output folder -> datamatrices -> csv-file named datamatrix_xxxxxxx

any other tool:
It is also possible to use the outputs of other tools. However, they have to be adapted so that they resemble the output of one of the tools listed above. If this is not possible for some cases but important to you please contact me at yasin.el.abiead@univie.ac.at

Selecting a benchmark dataset and starting an assessment

Next, the benchmark file has to be selected. If a benchmark has been created during this shiny session (the benchmark is still visible in the panel benchmark overview) the switch button “Use generated benchmark” can be clicked as an alternative.

After performing those steps the assessment can be started via the blue “Start assessment button”.

Perform reliability assessment via R-Script

#select the output format of which tool you would like to read in (exportable from the different tools as described above)
#options: XCMS, XCMS3, Metaboanalyst, SLAW, El-Maven, OpenMS, MS-DIAL, MZmine 3, or MZmine 2
algo = "XCMS"

#load benchmark csv file
benchmark <- 
  check_benchmark_input(
    file = "Path_to_benchmark_csv_file.csv",
    algo = algo
    )

#load non-targeted output
NPP_output <- 
  check_nonTargeted_input(
    ug_table_path = "Path_to_unaligned_Output_csv_file.csv/.txt", #in case of multiple files (e.g. for MS-DIAL) supply a vector of paths
    g_table_path = "Path_to_aligned_Output_csv_file.csv/.txt", 
    algo = algo,
    options_table = benchmark$options_table
    )

#compare benchmark with non-targeted output
comparison <- 
  compare_peaks(
    b_table = benchmark$b_table,
    ug_table = NPP_output$ug_table,
    g_table = NPP_output$g_table,
    algo = algo
    )

#generate a list including different statistics of the comparison
comp_stat <- derive_performance_metrics(comparison)

#generate different sunburst plots for overview
plot_sunburst_peaks(comp_stat, comparison)
plot_sunburst_peakQuality(comp_stat, comparison)
plot_sunburst_alignment(comp_stat)

Matching between BM and NPP output (background)

Before any NPP-performance metrics can be generated mzRAPP is matching the non-targeted data pre-processing (NPP) output files against the benchmark (BM). How this is done is explained in the following.

Comparison of benchmark with non-targeted output

Matching of benchmark peaks with NT peaks before alignment:
Each benchmark peak (BP) is reported with the smallest and highest mz value contributing to the chromatographic peak. To be considered as a possible match for a BP, the mz value of an NT peak (NP) has to fall within these two values. Matching rules considering retention time (RT) are depicted in Figure 1a. An NP has to cover the whole core of a BP (referred to as “peak core” in the following) while having an RT within the borders (RTmin, RTmax) of the BP. If only a part of the peak core is covered by an NP with its RT in the borders of the BP, the NP it is counted as a split peak. NPs which are not overlapping with the core of a BP is not considered. In some cases, there can be more than one match between a BP and NPs (Figure 1d). In those cases, BPs corresponding to the same molecule but other isotopologues (IT) are considered to choose the NP leading to the smallest relative IT ratio (calculated via reported peak abundances) bias as compared to the predicted IT-ratio (Figure 1c). When more than two different IT is detected the ratio of each IT with the highest IT reported via NPP is considered.

\label{fig:figure1}<b>Figure 1 | </b> Matching between BM and NPP output

Figure 1 \| Matching between BM and NPP output


Matching of benchmark features with non-targeted features:
For an NT feature (NF) to be considered as a match for a benchmark feature (BF) its reported mz and RT value have to lie between the lowest/highest benchmark peak mzmin/mzmax and RTmin/RTmax of the considered benchmark feature, respectively (Figure 1b). Moreover, NFs containing NPs to which BPs have been matched before alignment are also considered, even when those NF were reported with RT and mz values not fitting BM information. In the case of multiple matches for the same BF, the same strategy as for NP is applied (Figure 1c and d). However, in the case of features, the mean area is calculated over all samples for which a BP is present.

Matching of non-targeted peaks to non-targeted features:
To count alignment errors (explained below) it is necessary to trace non-targeted peaks (NP) reported before alignment in the non-targeted features (NF) of the aligned file. This is done by matching the exact area reported in an unaligned file on the same area in the aligned file. This generally works since areas usually do not change during the alignment process. In the case of multiple area-matches, the match which is pointing towards the aligned feature with the most other matches (in other files) is selected. If a peak can not be detected in the aligned file but was detected in the unaligned file it is considered as being lost during the alignment process which is reported in one of the performance metrics (see below). This implies that when areas are willfully changed after the peak picking step (other than adding missing peaks via some version of a fill-gaps algorithm) the number of missed peaks will not be reliable anymore. However, other alignment error counts (detailed below) could still be reliable.

Generation and interpretation of NPP performance metrics

Reliability assessment results can be inspected in the panel “View NPP assessment”. Using the shiny user interface, different performance metrics are displayed at the top of the panel. All metrics are given with an empirical confidence interval (alpha = 0.95) in percent which is supposed to be representative for the entirety of (unknown) peaks in the provided raw data (mzML files). It is calculated by summing up individual metrics (which are described below) per molecule and then bootstrapping molecules (R = 1000). Confidence intervals with values < 0 are rounded up to 0.

Found/not found peaks

The number of benchmark peaks for which a match was found among the unaligned/aligned NPP results vs all peaks present in the benchmark (as shown in Figure 2). For explanations on how the matching of benchmark peaks with non-targeted peaks is performed please read the section “Matching between BM and NPP output (background)” above.

Interpretation:
This allows to conclude the percentage of peaks found/not found via non-targeted data pre-processing before peak picking and after feature processing. It is worth noting that peaks not detected during peak detection might be added via fill-gaps algorithms or misalignment of split peaks (or other peaks). Peaks that were detected during peak detection but not recovered after feature processing could be explained by alignment errors (would be visible in the number of alignment errors or peaks lost during alignment which is detailed below).

Displayed in results as:
Number of benchmark peaks with at least one NPP match / Number of benchmark peaks (confidence interval)

\label{fig:figure2}<b>Figure 2 | </b> Overview of different peak populations

Figure 2 \| Overview of different peak populations

Split peaks

The number of split peaks that have been found for all benchmark peaks. For a graphical explanation of a split peak please check Figure 1a. It is worth noting that there can be more than one split peak per benchmark peak.

Interpretation:
Split peaks account for peaks with very poorly set integration boundaries. A high number of split peaks can inherit problems to the alignment process as well as produce a high number of artificial noise features.

Displayed in results as:
Number of split peaks found over all benchmark peaks / (Number of benchmark peaks with at least one NPP match + Number of split peaks found over all benchmark peaks) (confidence interval)

Missing peaks/values classification

The classification of not found peaks (Not found peaks, as defined in Figure 2) into high and low is done for each benchmark feature individually (as shown in Figure 3). Since we do not want to rely on the benchmark alignment to be correct we only classify missing values if the alignment of the benchmark is in agreement with the alignment performed by NPP in at least one isotopologue (as also described in Figure 5). Classification is based on the lowest benchmark peak present in the respective feature which has been found by the non-targeted algorithm. All benchmark peaks in this feature that have a benchmark area which is more than 1.5 times higher than the lowest benchmark peak found via the non-targeted approach are considered high. Otherwise, they are considered as low. Additionally, if no peak has been found over the whole benchmark feature all corresponding peaks will be classified as lost peaks

Interpretation:
It is important to know if missing peaks are generally due to their low intensities or if there is a significant risk of missing higher peaks. This is especially true when missing value imputation methods are used which are often based on one of those assumptions. A high number of high missing peaks at the peak picking step is common in features in which a majority of peaks express a poor signal to noise ratio. After alignment, this can also be explained by problems in the peak alignment (would be visible in the number of alignment errors as detailed below).

Displayed in results as:
Number of high missing values / (Number of high missing values + Number of low missing values) (confidence interval)

\label{fig:figure3}<b>Figure 3 | </b> Diffentiation between classes of missing peaks/values

Figure 3 \| Diffentiation between classes of missing peaks/values

Peak abundance quality/degenerated IR

Isotopologue abundance ratios (IR) are calculated relative to the highest isotopologue of each molecule. If the relative bias of an IR bias calculated using NPP-abundances is exceeding the tolerance (outlined in Figure 4) it is reflected in this variable.

Interpretation:
IRs are used to make a judgment on the quality of reported peak abundances. Since IRs can be predicted from molecular formulas and are confirmed during the benchmark dataset generation they can be considered as a reliable reference point for non-targeted data pre-processing assessment. A high number of degenerated IR at the peak detection step could be explained by problems in setting integration boundaries or by problems in the extraction of chromatograms (e.g., too wide or too narrow binning of mz-values). After the feature processing step it could also be explained by alignment errors (would be visible in the number of alignment errors as detailed below) or problems in the fill-gaps process.

Displayed in results as:
Number of isotopologue ratio exceeding tolerance / Number of isotopologue ratios which can be calculated using reported NPP abundances (confidence interval)

\label{fig:figure4}<b>Figure 4 | </b> Calculation of NPP abundance bias tolerance

Figure 4 \| Calculation of NPP abundance bias tolerance

Alignment error counting

We use a form of error counting which does not rely on the correct alignment of the benchmark dataset itself. Figure 5 shows three isotopologues (IT) of the same benchmark molecule detected in 5 samples. The color-coding indicates the feature the peak has been assigned to by the NPP algorithm. Whenever there is an asymmetry in the assignment of the different ITs the number of steps necessary to reverse that asymmetry are counted as errors. Counting benchmark divergences, on the other hand, assumes correct alignment of the benchmark dataset. By that nature the number of errors is a subset of all benchmark alignment divergences. Finally lost peaks correspond to the peaks which have been matched from the peak detection step but which are not present anymore after the alignment step. All those counts are given in the output of the NPP-assessment.

Interpretation:
A high number of alignment errors can be partly due to a high number of split peaks (discussed above) or too wide/narrow search width in the mz or RT dimension by NPP. Lost peaks can often be attributed to NPP alignment settings which delete features if they are not populated by enough peaks or when peak areas are adapted in the feature processing step, in which case the number of lost peaks becomes unreliable (however, we are currently not aware of a NPP-tool doing that). It is worth noting that some NPP alignment algorithms lose peaks by aligning only one peak per time unit and file and deleting all other peaks falling into the same time unit (e.g., xcms’ group.density()).

Displayed in results as:
Number of one of three error-types / Number of benchmark peaks for which an NPP-match has been found (confidence interval)

\label{fig:figure5}<b>Figure 5 | </b> Counting alignment errors

Figure 5 \| Counting alignment errors

References

mzRAPP is based on many other R packages. Specifically, it depends on (data.table, ggplot2, shinyjs and dplyr) and imports (shinybusy, shinydashboard, shiny, shinyWidgets, doFuture, plotly, tcltk, hutils, DT, tools, retistruct, xcms, multtest, enviPat, stats, parallel, doBy, shinycssloaders, BiocParallel, doParallel, MSnbase, DescTools, signal, intervals, future.apply, foreach, S4Vectors, V8, boot, future, bit64, htmltools, kableExtra, shinythemes and knitr ) packages from R Cran and Bioconductor. Moreover, we implemented code from two different Stackoverflow contributions (https://stackoverflow.com/questions/37169039/direct-link-to-tabitem-with-r-shiny-dashboard?rq=1 and https://stackoverflow.com/questions/57395424/how-to-format-data-for-plotly-sunburst-diagram)