Refactor main ramclust.R function

hechth commented 2 years ago

The ramclust.R file contains a function covering the whole workflow, but the rc.*.R files actually contain the same functionality in multiple steps, which is more convenient to test and maintain.

[x] #30
[x] Replace the sections in ramclust.R with the respective sub-steps of the workflow
[ ] Implement unit tests for all functions
[x] Include a data-flow diagram and step-wise procedure in the documentation
[ ] Group lower-level functions into higher top-level functions

arpita-007 commented 1 year ago

Hi, I am using flow which you mentioned. But the function 'rc.ramclustr' is showing the following error-

RC_F <- rc.ramclustr(ramclustObj = RC_E, st = NULL,

sr = NULL, maxt = NULL, deepSplit = FALSE, blocksize = 2000,

mult = 5, hmax = NULL, collapse = TRUE,

minModuleSize = 2, linkage = "average",

cor.method = "pearson", rt.only.low.n = TRUE, fftempdir = NULL) calculating ramclustR similarity: nblocks = 3 1 2 3 RAMClust feature similarity matrix calculated and stored: RAMClust distances converted to distance object fastcluster based clustering complete dynamicTreeCut based pruning complete RAMClust has condensed 2652 features into 444 spectra collapsing feature into spectral signal intensities Error in rc.ramclustr(ramclustObj = RC_E, st = NULL, sr = NULL, maxt = NULL, : this appears to be an older format ramclustR object and does not have a "phenoData" slot with sample names

If I use the function 'ramclustr', it is asking for xcms object. If I give xcms object, then it is telling me to do the filtering before clustering. Can you pleaseeeeee help me out!!!!! I am struggling a lot! Any help would be much appreciated.

Thank you!

cbroeckl commented 1 year ago

@arpita-007 I think this is an easy fix. It is asking you for phenotype data, which must be missing. you can add phenotype/experimental design data using the defineExperiment function, then feeding that in as an option in the rc.get.xcms.data() function with the ExpDes option.

pheno <- RAMClustR::defineExperiment() RC <- RAMClustR::rc.get.xcms.data( ExpDes = pheno) RC <- RAMClustR::rc.ramclustr(ramclustObj = RC)

arpita-007 commented 1 year ago

@cbroeckl Thank you so much for responding and for the guidance. Your suggestion worked. I could do the clustering after subtracting blank and normalization. But now I am getting an error in importing the msfinder.formulas.

import.msfinder.formulas(ramclustObj = RC_F, msp.dir = NULL) Press 1 for .mat or 2 for .msp to continue2 Error in do[[i]] : subscript out of bounds import.msfinder.formulas(ramclustObj = RC_F, mat.dir = NULL, msp.dir = NULL) Press 1 for .mat or 2 for .msp to continue1 Error in do[[i]] : subscript out of bounds import.msfinder.formulas(ramclustObj = RC_F) Press 1 for .mat or 2 for .msp to continue2 Error in do[[i]] : subscript out of bounds import.msfinder.formulas(ramclustObj = RC_F, mat.dir = NULL, msp.dir = "C:/Users/DR Pallavi Lab/Documents/spectra/ms/spectra/msp") Press 1 for .mat or 2 for .msp to continue 2 Error in do[[i]] : subscript out of bounds

Also while exporting the data with exportDataset() function I am getting this-

exportDataset( ramclustObj = RC_G, which.data = "SpecAbund", label.by = "ann", appendFactors = TRUE) Error in which(row.names(ramclustObj$ExpDes$design) == "fact1name"):(which(row.names(ramclustObj$ExpDes$design) == : argument of length 0

Thank you in advance!!

cbroeckl commented 1 year ago

Did you run MSFinder? You need to run this program manually using the exported .mat files as input, then run import.msfinder.formulas. If MSFinder ran, it should have written directories for each compound which contain formula results which ramclustR imports. At this time there are no R-based tools which perform a comparable set up steps, so we are reliant on running external programs (MSFinder or Sirius are the ones i have used and have import functions for, currently) for the actual MS/MS spectrum annotation.

arpita-007 commented 1 year ago

Thank you @cbroeckl!! I will do as you suggested.

arpita-007 commented 1 year ago

Hi @cbroeckl

I was using the same flow again for a different experiment and the same error appeared. I did as you suggested but it is not working.

pheno <- RAMClustR::defineExperiment() RC <- RAMClustR::rc.get.xcms.data(xcmsObj = fill_GRP,

taglocation = "pathGRP",

MStag = NULL,

MSMStag = NULL,

ExpDes = pheno,

mzdec = 3,

ensure.no.na = TRUE) RC_B <- rc.feature.replace.na(

ramclustObj = RC,

replace.int = 0.1,

replace.noise = 0.1,

replace.zero = TRUE) replaced 445885 of 1032504 total feature values ( 43 % ) RC_C <- rc.feature.filter.blanks(ramclustObj = RC_B,

qc.tag = c("QC", "sample.names.sample_group"),

blank.tag = c("Blank", "sample.names.sample_group"),

sn = 3, remove.blanks = TRUE) 41.1% of features move forward df phenoData ma MSdata Features which failed to demonstrate signal intensity of at least 3 fold greater in QC samples than in blanks were removed from the feature dataset. 25336 of 43021 features were removed. RC_D <- rc.feature.normalize.tic(ramclustObj = RC_C) RC_E <- rc.feature.filter.cv(ramclustObj = RC_D, qc.tag = c("QC", "sample.names.sample_group"),

max.cv = 0.3) MSdata : 5477 passed the CV filter Features were filtered based on their qc sample CV values. Only features with CV vaules less than or equal to 0.3 in MSdata set were retained. 12208 of 17685 features were removed. RC_F <- RAMClustR::rc.ramclustr(ramclustObj = RC_E) calculating ramclustR similarity: nblocks = 6 1 2 3 4 5 6 RAMClust feature similarity matrix calculated and stored: RAMClust distances converted to distance object fastcluster based clustering complete dynamicTreeCut based pruning complete RAMClust has condensed 5477 features into 851 spectra collapsing feature into spectral signal intensities Error in RAMClustR::rc.ramclustr(ramclustObj = RC_E) : this appears to be an older format ramclustR object and does not have a "phenoData" slot with sample names

I created an experiment design. You were telling about phenotype data. If I am not wrong, phenotype data and phenoData (shown in error) are different. I am not sure what to do in this case.

Thank you

cbroeckl commented 1 year ago

@arpita-007 - what does this show:

RC_F$ExpDes

RC_F$phenoData

fill_GRP@phenoData

the @phenoData slot from the xcms object should be brought to the RAMClustR object - this error suggests that this isn't happening, at least not in the way i anticipated.

arpita-007 commented 1 year ago

Then what can be done to bring the phenoData to the RAMClustR object?

cbroeckl commented 1 year ago

show me the output of these:

head(RC_F$ExpDes)

head(RC_F$phenoData)

head(fill_GRP@phenoData)

arpita-007 commented 1 year ago

RC_F is not yet created because of the error. Here is the RC_E:

head(RC_E$ExpDes) $design Value Description Experiment GRP experiment name, no spaces Species Homo sapiens species name Sample Serum sample type Contributor Arpita individual and/or organizational affiliation platform LC-MS GC-MS or LC-MS

$instrument value chrominst Dionex 3000 msinst Orbitrap fusion column Acquity HSS T3 solvA Water solvB Methanol CE1 30 V CE2
mstype Orbi msmode Positive ionization ESI colgas Helium msscanrange 50-1500 Da conevolt 30 V MSlevs 2

head(RC_E$phenoData) sample.names.sample_name sample.names.sample_group filenames 2 A2_QC1 QC A2_QC1.mzML 4 A4_A_1 Sample A4_A_1.mzML 5 A5_A_2 Sample A5_A_2.mzML 7 A7_C_1 Sample A7_C_1.mzML 8 A8_C_2 Sample A8_C_2.mzML 10 B1_D_1 Sample B1_D_1.mzML filepaths 2 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\A2_QC1.mzML 4 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\A4_A_1.mzML 5 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\A5_A_2.mzML 7 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\A7_C_1.mzML 8 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\A8_C_2.mzML 10 C:\Users\Metabolomics\OneDrive\Desktop\Arpita_Mani\GR_raw data\GR_XCMS_pos\B1_D_1.mzML head(fill_GRP@phenoData

) An object of class 'NAnnotatedDataFrame' rowNames: 1 2 ... 6 (6 total) varLabels: sample_name sample_group varMetadata: labelDescription Multiplexing: 1 - Single run

cbroeckl commented 1 year ago

what does this return?

is.null(RC_E$phenoData$sample.names)

cbroeckl commented 1 year ago

and this:

names(RC_E$phenoData)

arpita-007 commented 1 year ago

is.null(RC_E:$phenoData$sample.names) Error: unexpected '$' in "is.null(RC_E:$"

arpita-007 commented 1 year ago

names(RC_E$phenoData) [1] "sample.names.sample_name" "sample.names.sample_group" "filenames"
[4] "filepaths"

arpita-007 commented 1 year ago

I tried this too:

is.null(RC_E:$phenoData$sample.names) Error: unexpected '$' in "is.null(RC_E:$" is.null(RC_E:$phenoData$sample.names.sample_name) Error: unexpected '$' in "is.null(RC_E:$"

cbroeckl commented 1 year ago

i think the issue is that the first column of your RC_E$phenoData data frame is supposed to be 'sample.names' but for some reason is isn't. Try this:

names(RC_E$phenoData)[1] <- "sample.names" RC_F <- RAMClustR::rc.ramclustr(ramclustObj = RC_E)

arpita-007 commented 1 year ago

Resolved I guess!

names(RC_E$phenoData)[1] <- "sample.names" RC_F <- RAMClustR::rc.ramclustr(ramclustObj = RC_E) calculating ramclustR similarity: nblocks = 6 1 2 3 4 5 6 RAMClust feature similarity matrix calculated and stored: RAMClust distances converted to distance object fastcluster based clustering complete dynamicTreeCut based pruning complete RAMClust has condensed 5477 features into 854 spectra collapsing feature into spectral signal intensities RC_F

Call: fastcluster::hclust(d = tmp.ramclustObj, method = linkage)

Cluster method : average Distance : RAMClustR Number of objects: 5477

cbroeckl commented 1 year ago

I am not sure why this happened - i will have to some more homework, but this gets you moving forward.

arpita-007 commented 1 year ago

@cbroeckl Thanks a lot again :)

arpita-007 commented 1 year ago

Sorry to bother you again, but can you please tell in rc.get.xcms.data(xcmsObj = fill_GDMHCP, taglocation = "phenoData[,1]", MStag = NULL, MSMStag = NULL, ExpDes = pheno, mzdec = 4, ensure.no.na = FALSE)

what file should be given in MStag?

Thanks

hechth commented 1 year ago

@arpita-007 The MStag parameter is not a file - how do you indicate which files are MS1 and which are MS2? Or do only use MS1 data?

arpita-007 commented 1 year ago

@hechth We do not have separate files for MS1 and MS2. We use single files for both. Though we have MS2 data written in mgf. format by XCMS, can we use that?

hechth commented 1 year ago

@arpita-007 the idea behind RAMClustR is to extract MS1 and MS2 info from the files individually and run XCMS on those and then in the peak alignment step to align the feature tables, representing MS1 and MS2 as different samples.

If you have MS2 data in mgf format from XCMS, can you check if the MS2 data is also contained in the XCMS object used in R?

cbroeckl commented 1 year ago

@arpita-007 - if you have only MS1, if i recall you can just leave it as NULL and the processing will proceed appropriately. RAMClustR doesn't currently deal with DDA-like MS/MS data.

arpita-007 commented 1 year ago

@hechth I could not locate the XCMS object containing the MS2 data. But as @cbroeckl suggested, I proceeded with MS1 only. Thanks to both of you for solving all my doubts and making it easier for me. Thank you :)

arpita-007 commented 1 year ago

Hi, Can you please help me to understand this error? I am getting t his for a particular file only. I ran same code for 3 different mode files (RP pos, RP neg, HILIC pos) but I am seeing this error for my 4th file.

library(RAMClustR)

pheno <- RAMClustR::defineExperiment() path2 <- file.path("E:/Placenta_final files/RAMClustR_clustering/PHCN_input_clustering_after corr.csv") path2 [1] "E:/Placenta_final files/RAMClustR_clustering/PHCN_input_clustering_after corr.csv" RC_PHCN <- ramclustR(ms = path2,

featdelim = "_",

st = 5,

ExpDes = pheno,

sampNameCol = 1) organizing dataset normalizing dataset Calculating ramclustR similarity using 3 nblocks. 1 2 3 Error in ramclustObj[startv:stopv] <- column : replacement has length zero

cbroeckl commented 1 year ago

@arpita-007 - can you send me the file you are using as input? cbroeckl at colostate dot edu.

arpita-007 commented 1 year ago

PHCN file is giving error while PHCP processed successfully with same codes.

PHCN_input_clustering_after corr.csv PHCP_input_clustering_after corr.csv

cbroeckl commented 1 year ago

@arpita-007 - i think this is a rare event coupled with imperfect code. the file that fails has exactly 2000 features, which happens to be what the default blocksize setting is. try setting the option in the ramclustr function: blocksize = 1200. i suspect it will run fine. let me know if this fixes it please!

arpita-007 commented 1 year ago

@cbroeckl Yes, it fixed the issue. Thank you.

hechth commented 1 year ago

@cbroeckl thanks for the proposed solution - we will implement a bugfix for that!

hechth commented 1 year ago

@arpita-007 and @cbroeckl I think we can maybe close this issue as most things have been adressed and resolved?

I created issues for the things which still have to be taken care of.

Most other things are adressed in the open PR #39

arpita-007 commented 1 year ago

@hechth Yes sure. Thank you!

cbroeckl / RAMClustR

Refactor main ramclust.R function #29