fgcz / rawrr

Access Orbitrap data in R lang using C# mono assembly - bioconductor package
https://bioconductor.org/packages/rawrr/
54 stars 9 forks source link

Expanded readIndex table #48

Closed SelbachLab closed 2 years ago

SelbachLab commented 2 years ago

Hi, I worked a lot with RawDiag in the past and really love this package, many thx! For certain reasons I need to switch to rawrr and I miss the information you already got from the index table in rawDiag. Especially the ScanDescription is missing and for QC many of the other values were of great help. Do you think there is a way to expand the exported column set similar to the one from rawDiag? It would speed up my code tremendously. My current solution is to read each spectrum and get the information from there, but it's very time consuming. All the best Henrik

tobiasko commented 2 years ago

Hi Henrik,

nice to hear that our packages are useful! Could you please upload a small code chunk that showcases how you currently use rawDiag and how you envision to use rawrr code instead? The bigger picture is: At some point we need to clean up/redo rawDiag. Meaning it would anyway use rawrr under the hood for data access. The first step is of course understanding what users would like to retrieve and design some sensible functions and managed code. The table returned by rawrr::readIndex and the table returned by rawDiag are indeed similar, but we tried to keep the readIndex-derived table very minimalistic to draw as little unneeded information as possible.

Best, Tobi

SelbachLab commented 2 years ago

Hi Tobi, I basically use it for a small PRM analyser tool, scanning for intensities of mz listed in a library. We found that rawDIAG doesn't work on our windows server (problem with temporary files, kind of unsolved) I now decided to switch to rawrr. I tried two strategies:

The first would be to use the readChromatogram function to read intensities from masses listed in a library. I found that this is too slow compared to the second approach: Here I read chunks of ms2 spectra with readSpectrum and assemble the intensities for matching mz scans on my own. Now comes the tricky part: We need to prefilter out some scans to be more fast and specific. For instance, we filter out DDA component in mixed PRM acquisitions or preScans in SureQuant runs. Especially the latter doesn't directly work for now with rawrr, since I only get the ScanDescription from the readSpectrum function. I found for now a workaround, which is to filter after readSpectrum. I understand, that the index table should obtain only the most relevant information but the complete table as obtained from RawDiag was useful as well. If this is feasible at all, an option to get a comprehensive readOut might be useful (especially for us ;) ) for some users.

All the best Henrik

tobiasko commented 2 years ago

Hi Hendrik,

do I get you right? Your final goal is to extract fragment ion chromatograms?

Regarding "I found that this is too slow compared to the second approach" I have to say that something seems to be going very wrong here, since the power of readChromatogram is that your are directly using the managed code (#C API) of the RawFileReader library to drive the chromatogram extraction. There are NO intermediate steps done in R as retrieving scans and filtering peak lists (the API manages all of this!). But that is basically your claim: Doing that manually in R is faster than using the API. I am tempted to say: That is impossible. The RawFileReader API is very strong when it comes to scan filtering and chromatogram extraction.

Maybe it would make sense if you would share some raw files and code to better understand where the performance loss is coming from.

Best, Tobi

tobiasko commented 2 years ago

It is a bit unfortunate, because the html page is currently not rendered correctly, but still:

https://fgcz-ms.uzh.ch/~cpanse/rawR/test/functional_test.html

The R code that renders the html page is also executed on our server and does a lot of chromatogram extraction from raw files using rawrr.

@cpanse somehow the html page is not rendered correctly and the link to the code is not working. Could you please fix this?

SelbachLab commented 2 years ago

Dear Tobi, sry for not being precise enough. I agree that the readChromatogram is faster compared to readSpectrum but depended on the depth of the library (and probably also based on my implementation) I found that complex libaries benefit from simply reading the spectra and extract all information to all precursors in the library falling into the corresponding isolation and RT window. Anyhow our major problem at the moment is to get access to the ScanDescription. We need to analyse some SureQuant runs. These require preScans in the acquisition, which we like to exclude from the final analysis but the only way to find these scans is from the ScanDescription. With rawrr I found, that I can access them with the readSpectrum function. Any ideas? Meanwhile I'll upload a link to a rawfile. Your help is very appreciated!

SelbachLab commented 2 years ago

Hi Tobi, here's a link to a SureQuant Example rawfile: https://mdcsft.mdc-berlin.de/download.php/1743126505/Elmo_20210911_MVB_HS_prelim_T2.raw.zip

tobiasko commented 2 years ago

What do you mean by scanDescription? Do you mean the scanType attribute of the rawrrSpectrum object? Or is that some variable/column name that is returned by rawDiag::read.raw(file = rawfile, rawDiag = FALSE)

SelbachLab commented 2 years ago

The column name is $Scan Description:.

SelbachLab commented 2 years ago

It's returned by rawDiag::read.raw and by rawrr::readSpectrum.

tobiasko commented 2 years ago

Sorry that column does not exist:

> colnames(read.raw(file = rawfile, rawDiag = TRUE))
executing mono /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawDiag/exec/fgcz_raw.exe /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawrr/extdata/sample.raw qc ...
missing column name(s): MasterScanNumber, LMCorrection, ElapsedScanTimesec, transient, AGCMode, PrescanMode
MasterScanNumber calculated
renamed LMmZCorrectionppm to LMCorrection
renamed AGCPSMode to PrescanMode
 [1] "filename"           "scanNumber"         "StartTime"          "BasePeakMass"      
 [5] "BasePeakIntensity"  "TIC"                "ScanType"           "CycleNumber"       
 [9] "MSOrder"            "MassAnalyzer"       "PrecursorMass"      "AGC"               
[13] "ChargeState"        "IonInjectionTimems" "FTResolution"       "LMCorrection"      
[17] "PrescanMode"        "MasterScanNumber"   "ElapsedScanTimesec" "AGCMode"           
[21] "transient"         
SelbachLab commented 2 years ago

I see. I think, it shows up, if you set rawDiag = F

tobiasko commented 2 years ago

Not really...

> colnames(read.raw(file = rawfile, rawDiag = FALSE))
executing mono /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawDiag/exec/fgcz_raw.exe /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawrr/extdata/sample.raw qc ...
 [1] "filename"               "scanNumber"             "ScanEventNumber"       
 [4] "StartTime"              "BasePeakMass"           "BasePeakIntensity"     
 [7] "TIC"                    "ScanType"               "CycleNumber"           
[10] "Frequency"              "HighMass"               "IonizationMode"        
[13] "MSOrder"                "MassAnalyzer"           "Detector"              
[16] "Lock"                   "PrecursorMass"          "LastPrecursorMass"     
[19] "CollisionEnergy"        "IsolationWidth"         "MultipleInjection"     
[22] "MultiInjectInfo"        "AGC"                    "MicroScanCount"        
[25] "ScanSegment"            "ScanEvent"              "MasterIndex"           
[28] "ChargeState"            "MonoisotopicmZ"         "IonInjectionTimems"    
[31] "MaxIonTimems"           "FTResolution"           "MS2IsolationWidth"     
[34] "MS2IsolationOffset"     "AGCTarget"              "HCDEnergy"             
[37] "AnalyzerTemperature"    "MassCalibration"        "ConversionParameterB"  
[40] "ConversionParameterC"   "TemperatureCompppm"     "RFCompppm"             
[43] "SpaceChargeCompppm"     "ResolutionCompppm"      "NumberofLockMasses"    
[46] "LockMass1mZ"            "LockMass2mZ"            "LockMass3mZ"           
[49] "LMSearchWindowppm"      "LMSearchWindowmmu"      "NumberofLMFound"       
[52] "LastLockingsec"         "LMmZCorrectionppm"      "IonOpticsSettings"     
[55] "SLensRFLevel"           "SLensVoltageV"          "SkimmerVoltageV"       
[58] "InjectFlatapoleOffsetV" "BentFlatapoleDCV"       "MP2andMP3RFV"          
[61] "GateLensVoltageV"       "CTrapRFV"               "DiagnosticData"        
[64] "DynamicRTShiftmin"      "IntensCompFactor"       "ResDepIntens"          
[67] "CTCDNumF"               "CTCDComp"               "CTCDScScr"             
[70] "RawOvFtT"               "LCFWHMparameter"        "Rod"                   
[73] "PSInjTimems"            "AGCPSMode"              "AGCPSDiag"             
[76] "HCDEnergyeV"            "AGCFill"                "Injectiont0"           
[79] "t0FLP"                  "AccessId"               "AnalogInput1V"         
[82] "AnalogInput2V"         

but these variables change with the instrument model. Could you please post a code chunk that shows this using your raw file?

tobiasko commented 2 years ago
> colnames(rawDiag::read.raw(file = "/Users/tobiasko/Downloads/Elmo_20210911_MVB_HS_prelim_T2.raw", rawDiag = FALSE))
executing mono /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawDiag/exec/fgcz_raw.exe /Users/tobiasko/Downloads/Elmo_20210911_MVB_HS_prelim_T2.raw qc ...
 [1] "filename"                   "scanNumber"                 "ScanEventNumber"           
 [4] "StartTime"                  "BasePeakMass"               "BasePeakIntensity"         
 [7] "TIC"                        "ScanType"                   "CycleNumber"               
[10] "Frequency"                  "HighMass"                   "IonizationMode"            
[13] "MSOrder"                    "MassAnalyzer"               "Detector"                  
[16] "Lock"                       "PrecursorMass"              "LastPrecursorMass"         
[19] "CollisionEnergy"            "IsolationWidth"             "ScanDescription"           
[22] "MultipleInjection"          "MultiInjectInfo"            "AGC"                       
[25] "MicroScanCount"             "ScanSegment"                "ScanEvent"                 
[28] "MasterIndex"                "MasterScanNumber"           "ChargeState"               
[31] "MonoisotopicmZ"             "Errorinisotopicenvelopefit" "IonInjectionTimems"        
[34] "MaxIonTimems"               "FTResolution"               "MS2IsolationWidth"         
[37] "MS2IsolationOffset"         "AGCTarget"                  "HCDEnergy"                 
[40] "HCDEnergyV"                 "AnalyzerTemperature"        "MassCalibration"           
[43] "ConversionParameterB"       "ConversionParameterC"       "TemperatureCompppm"        
[46] "RFCompppm"                  "SpaceChargeCompppm"         "ResolutionCompppm"         
[49] "NumberofLockMasses"         "LockMass1mZ"                "LockMass2mZ"               
[52] "LockMass3mZ"                "LMSearchWindowppm"          "LMSearchWindowmmu"         
[55] "NumberofLMFound"            "LastLockingsec"             "LMmZCorrectionppm"         
[58] "IonOpticsSettings"          "FunnelRFLevel"              "DiagnosticData"            
[61] "ApplicationMode"            "MildTrappingMode"           "APD"                       
[64] "OTIntensCompFactor"         "ResDepIntens"               "QTransComp"                
[67] "PrOSANumF"                  "PrOSAComp"                  "PrOSAScScr"                
[70] "RawOvFtT"                   "DynamicRTShiftmin"          "LCFWHMparameter"           
[73] "PSInjTimems"                "AGCPSMode"                  "AGCPSDiag"                 
[76] "AGCTargetAdjust"            "AGCDiag1"                   "AGCDiag2"                  
[79] "HCDabsOffset"               "SourceCIDeV"                "AGCFill"                   
[82] "Injectiont0"                "t0FLP"                      "IsoParaR"                  
[85] "InjParaR"                   "AccessId"                   "AnalogInAV"                
[88] "AnalogInBV"                 "FAIMSAttached"              "FAIMSVoltageOn"            
[91] "FAIMSCV"                    "PhiSDM"                     "PhiSDMWindowsmZ"           
[94] "PhiSDMIterations"           "PhiSDMFlagging"             "PhiSDMProcessingTimems"    
SelbachLab commented 2 years ago

Sure:

>     colnames(RAW <- read.raw(file = r,rawDiag = F))
executing mono /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rawDiag/exec/fgcz_raw.exe /Volumes/MAGIC_PAGIC/MDC/ValiTest/Elmo_20210911_MVB_HS_prelim_H2.raw qc ...
 [1] "filename"                   "scanNumber"                 "ScanEventNumber"            "StartTime"                  "BasePeakMass"              
 [6] "BasePeakIntensity"          "TIC"                        "ScanType"                   "CycleNumber"                "Frequency"                 
[11] "HighMass"                   "IonizationMode"             "MSOrder"                    "MassAnalyzer"               "Detector"                  
[16] "Lock"                       "PrecursorMass"              "LastPrecursorMass"          "CollisionEnergy"            "IsolationWidth"            
[21] "ScanDescription"            "MultipleInjection"          "MultiInjectInfo"            "AGC"                        "MicroScanCount"            
[26] "ScanSegment"                "ScanEvent"                  "MasterIndex"                "MasterScanNumber"           "ChargeState"               
[31] "MonoisotopicmZ"             "Errorinisotopicenvelopefit" "IonInjectionTimems"         "MaxIonTimems"               "FTResolution"              
[36] "MS2IsolationWidth"          "MS2IsolationOffset"         "AGCTarget"                  "HCDEnergy"                  "HCDEnergyV"                
[41] "AnalyzerTemperature"        "MassCalibration"            "ConversionParameterB"       "ConversionParameterC"       "TemperatureCompppm"        
[46] "RFCompppm"                  "SpaceChargeCompppm"         "ResolutionCompppm"          "NumberofLockMasses"         "LockMass1mZ"               
[51] "LockMass2mZ"                "LockMass3mZ"                "LMSearchWindowppm"          "LMSearchWindowmmu"          "NumberofLMFound"           
[56] "LastLockingsec"             "LMmZCorrectionppm"          "IonOpticsSettings"          "FunnelRFLevel"              "DiagnosticData"            
[61] "ApplicationMode"            "MildTrappingMode"           "APD"                        "OTIntensCompFactor"         "ResDepIntens"              
[66] "QTransComp"                 "PrOSANumF"                  "PrOSAComp"                  "PrOSAScScr"                 "RawOvFtT"                  
[71] "DynamicRTShiftmin"          "LCFWHMparameter"            "PSInjTimems"                "AGCPSMode"                  "AGCPSDiag"                 
[76] "AGCTargetAdjust"            "AGCDiag1"                   "AGCDiag2"                   "HCDabsOffset"               "SourceCIDeV"               
[81] "AGCFill"                    "Injectiont0"                "t0FLP"                      "IsoParaR"                   "InjParaR"                  
[86] "AccessId"                   "AnalogInAV"                 "AnalogInBV"                 "FAIMSAttached"              "FAIMSVoltageOn"            
[91] "FAIMSCV"                    "PhiSDM"                     "PhiSDMWindowsmZ"            "PhiSDMIterations"           "PhiSDMFlagging"            
[96] "PhiSDMProcessingTimems" 
SelbachLab commented 2 years ago

btw in RawDiag it's named ScanDescription.

tobiasko commented 2 years ago

ok! Now it get it. This is a scan attribute that is not present in my data. Could it be that this is some type of free text description that you added to the LC-MS method? Or is this written by every SureQuant method template on the Exploris 480?

> Elmo <- rawDiag::read.raw(file = "/Users/tobiasko/Downloads/Elmo_20210911_MVB_HS_prelim_T2.raw", rawDiag = FALSE)
executing mono /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rawDiag/exec/fgcz_raw.exe /Users/tobiasko/Downloads/Elmo_20210911_MVB_HS_prelim_T2.raw qc ...
> levels(as.factor(Elmo$ScanDescription))
 [1] ""                "SQ_ENDO_K+8_1"   "SQ_ENDO_K+8_2"   "SQ_ENDO_K+8_3"   "SQ_ENDO_R+10_1" 
 [6] "SQ_ENDO_R+10_2"  "SQ_ENDO_R+10_3"  "SQ_IS_K+8_1"     "SQ_IS_K+8_2"     "SQ_IS_K+8_3"    
[11] "SQ_IS_R+10_1"    "SQ_IS_R+10_2"    "SQ_IS_R+10_3"    "SQ_IS_R+10_4"    "SQ_QUANT_K+8_1" 
[16] "SQ_QUANT_K+8_2"  "SQ_QUANT_K+8_3"  "SQ_QUANT_R+10_1" "SQ_QUANT_R+10_2" "SQ_QUANT_R+10_3"
tobiasko commented 2 years ago

You are trying to fetch this code here?

ScanDescription_SureQuant

tobiasko commented 2 years ago

hmmm... 🤔 ok! I think a have a workaround for you that could be used for extracting the information. It uses some functions the packages isn't exporting. 😚 Some black magic! 🤪 Two simple steps: 1) def. a custom accessor function (a function that returns a function 2) tabulate the return values

> S <- rawrr::readSpectrum(rawfile = "/Users/tobiasko/Downloads/Elmo_20210911_MVB_HS_prelim_T2.raw", scan = 1:1000)
> ScanDescription <- rawrr::makeAccessor(key = "Scan Description:", returnType = "character")
> df <- rawrr:::tabulateSpectrum(S[[1]], c("scanNumber", "ScanDescription"))
> l <- rawrr:::tabulateSpectrumSet(S, c("scanNumber", "ScanDescription"))

Have a look at the source code of tabulateSpectrumSet. You could make it even more versatile by using map_df instead of lapply.

tabulateSpectrumSet <- function (x, accNames){

  #if (!requireNamespace("purrr", quietly = TRUE)) {
  #  stop("Package \"pkg\" needed for this function to work. Please install it.",
  #       call. = FALSE)
  #}

  stopifnot(all(vapply(x, is.rawrrSpectrum, TRUE)))

  lapply(x, tabulateSpectrum, accNames)
  #purrr::map_df(x, tabulateSpectrum, accNames)

}

This will return a data.frame instead a list of data.frames

The map alternative would be something like:

mapSpectrumSet <- function (x, accNames){

if (!requireNamespace("purrr", quietly = TRUE)) {
  stop("Package \"pkg\" needed for this function to work. Please install it.",
       call. = FALSE)
}

stopifnot(all(vapply(x, is.rawrrSpectrum, TRUE)))

  #lapply(x, tabulateSpectrum, accNames)
purrr::map_df(x, rawrr:::tabulateSpectrum, accNames)

}

and get a data.frame with the stuff you need:

> df <- mapSpectrumSet(S, c("scanNumber", "ScanDescription"))
> head(df)
  scanNumber ScanDescription
1          1                
2          2                
3          3                
4          4                
5          5                
6          6    SQ_IS_R+10_2

But it of course it still creates a lot of overhead, because you need to fetch the complete scan data first! But this is totally generic and can be done with any key:value pair found in a raw file!

SelbachLab commented 2 years ago

Hi, this looks very useful! I'll give it a try and tell you if it works! Black magic is definitely my thing 😄 Many thx (also for your patience, I should have explained my problem a bit better 🙈).

tobiasko commented 2 years ago

Welcome! Just let me know how it goes! But scan Description: is def. something we might consider as return value for rawrr::readIndex. We would need to check how expensive the access is or make it optional by having a parameter.

SelbachLab commented 2 years ago

Hi, just tested your code but realised, that I already have a similar approach with Spectra[[1]]$Scan Description:`` 😉 . Would be there an advantage when using your approach? Anyhow I'll cross fingers for making implementation of ScanDescription into rawrr::readIndex happening🥳 . Have nice evening and many thx!

tobiasko commented 2 years ago

The answer is

> class(S[[1]])
[1] "rawrrSpectrum"

rawrr is implemented using S3 object-oriented programming (OOP). That implies you should use methods to set and get instance attributes:

> attributes(S[[1]])
$names
 [1] "scan"                            "basePeak"                       
 [3] "TIC"                             "massRange"                      
 [5] "scanType"                        "rtinseconds"                    
 [7] "pepmass"                         "centroidStream"                 
 [9] "HasCentroidStream"               "centroid.mZ"                    
[11] "centroid.intensity"              "title"                          
[13] "charge"                          "monoisotopicMz"                 
[15] "mZ"                              "intensity"                      
[17] "Scan Description:"               "Multiple Injection:"            
[19] "Multi Inject Info:"              "AGC:"                           
[21] "Micro Scan Count:"               "Scan Segment:"                  
[23] "Scan Event:"                     "Master Index:"                  
[25] "Master Scan Number:"             "Charge State:"                  
[27] "Monoisotopic M/Z:"               "Error in isotopic envelope fit:"
[29] "Ion Injection Time (ms):"        "Max. Ion Time (ms):"            
[31] "FT Resolution:"                  "MS2 Isolation Width:"           
[33] "MS2 Isolation Offset:"           "AGC Target:"                    
[35] "HCD Energy:"                     "HCD Energy V:"                  
[37] "Analyzer Temperature:"           "=== Mass Calibration: ===:"     
[39] "Conversion Parameter B:"         "Conversion Parameter C:"        
[41] "Temperature Comp. (ppm):"        "RF Comp. (ppm):"                
[43] "Space Charge Comp. (ppm):"       "Resolution Comp. (ppm):"        
[45] "Number of Lock Masses:"          "Lock Mass #1 (m/z):"            
[47] "Lock Mass #2 (m/z):"             "Lock Mass #3 (m/z):"            
[49] "LM Search Window (ppm):"         "LM Search Window (mmu):"        
[51] "Number of LM Found:"             "Last Locking (sec):"            
[53] "LM m/z-Correction (ppm):"        "=== Ion Optics Settings: ===:"  
[55] "Funnel RF Level:"                "====  Diagnostic Data:  ====:"  
[57] "Application Mode:"               "Mild Trapping Mode:"            
[59] "APD:"                            "OT Intens Comp Factor:"         
[61] "Res. Dep. Intens:"               "Q Trans Comp:"                  
[63] "PrOSA NumF:"                     "PrOSA Comp:"                    
[65] "PrOSA ScScr:"                    "RawOvFtT:"                      
[67] "Dynamic RT Shift (min):"         "LC FWHM parameter:"             
[69] "PS Inj. Time (ms):"              "AGC PS Mode:"                   
[71] "AGC PS Diag:"                    "AGC Target Adjust:"             
[73] "AGC Diag 1:"                     "AGC Diag 2:"                    
[75] "HCD abs. Offset:"                "Source CID eV:"                 
[77] "AGC Fill:"                       "Injection t0:"                  
[79] "t0 FLP:"                         "Iso Para R:"                    
[81] "Inj Para R:"                     "Access Id:"                     
[83] "Analog In A (V):"                "Analog In B (V):"               
[85] "FAIMS Attached:"                 "FAIMS Voltage On:"              
[87] "FAIMS CV:"                       "PhiSDM:"                        
[89] "PhiSDM Windows m/z:"             "PhiSDM Iterations:"             
[91] "PhiSDM Flagging:"                "PhiSDM Processing Time ms:"     

$class
[1] "rawrrSpectrum"

Using the $ operator is kind of a dirty hack! The clean way is defining a method. Why? "The user doesn’t need to worry about details of an object because they are encapsulated behind a standard interface." So stop thinking why Thermo puts white spaces and : in keys, or refuses to use boolean values. Instead build a method a deals with this this properly. A good example is the accessor function faimsVoltageOn (already part of the package):

> faimsVoltageOn
function (x) 
{
    stopifnot(is.rawrrSpectrum(x))
    if ("FAIMS Voltage On:" %in% names(x)) {
        y <- x$`FAIMS Voltage On:`
        lookup <- c(No = "FALSE", Yes = "TRUE")
        return(as.logical(lookup[y]))
    }
    else {
        stop("FAIMS Voltage On: is not available!")
    }
}

It offers proper error handling (since attributes are not always present, depends on instrument model) and maps no/yes (character) to TRUE/FALSE.

SelbachLab commented 2 years ago

Of course, OOP is a good reason to do so. I rather was asking about the readout, but I'm aware that it's a "dirty" hack 😜 . Many thx

tobiasko commented 2 years ago

@cpanse we should discuss if it makes sense to include the Scan Description: in the data frame returned by readIndex. I think it is a rather generic description that might be import to guided targeted extraction of scans. I think it is also written by tribrid instrument models (Fusion, Lumos, ...).