Closed xiweifan closed 7 months ago
Hi, Xiwei Yes, I've rewritten the peak picking method. A "Default" method is available to bin the data using a native peak picking function. The cardinal peak picking function will keep the m/z in a sparse data frame until it gets aligned. But this is not very RAM friendly since we want to perform the peak alignment within the selected region (rather than the whole image) to reduce the misalignment. Can you confirm if it is the segmentation step that requires an extra-large RAM space? If that's the case, you will need to force the segmentation to use top features to conduct the pixel clustering. I will incorporate this function shortly and let you know when this is finished.
Hi George,
Thank you so much for your update! Indeed the problem was caused by cardinal and Biocparallel. The RAM is not enough when I tried to run the below commands alone (1.6GB ibd database, 1.2TB RAM and 1Thread), I believe those codes are solely for Data pre-processing and proteomics annotation:
datafile=c("XXX/XXX.imzML") wd="~/expdata/" library(HiTMaP) imaging_identification(Thread=1,datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta",output_candidatelist=T,use_previous_candidates=T, preprocess=list(force_preprocess=TRUE, use_preprocessRDS=TRUE, smoothSignal=list(method="Disable"), reduceBaseline=list(method="Disable"), peakPick=list(method="adaptive"), peakAlign=list(tolerance=5, units="ppm"), normalize=list(method=c("rms","tic","reference")[1],mz=1)), spectra_segments_per_file=2,ppm=5,FDR_cutoff = 0.05,IMS_analysis=T, plot_cluster_image_grid=F)
However, as a software user, I figured out several ways to optimize the file size, including using peak list option, rather than whole spectra in Bruker software to export the imzML file. Limiting the area of the region of interest could also significantly reduce the RAM and process pressure, etc. I have successfully run several examples till now. 300MB ibd file required 346GB RAM, 76MB ibd file required 213GB RAM. Hope this will help other users to accommodate their projects.
As an HPC user, it is appreciated that the updated features can be contained in the docker file so we can use a singularity image to run the program. Thank you so much, and congratulations on your impressive work again!
Thanks, Xiwei
HI, XIwei, I've edited the title of this issue in a clearer way for other readers. I've made the changes to improve the large MSI file statistical analysis performance. Before the PCS/SKM/SCC segmentation analysis, a subset of representative features will be generated through alignment and filtering, covering the major variance but with less size. I've built the Docker image using the latest packages and assigned the tag as mashuoa/hitmap:largefile. Could you please advise me if that can helps with the memory and go through the pipeline?
Hi George,
Thank you so much for your timely update! I will have a test ASAP.
Meanwhile, I am wondering if the imaging rendering function also got some little bug? Since I can only produce individual cluster ion images but I can not open the R plot.pdf ( Rplots.pdf ) since the software said there is nothing inside.
Below are the codes I used for imaging rendering in Windows since this step does need large amount of RAM resource, the previous results looks perfect already:
library(HiTMaP) datafile=c("XXX/XXX.imzML") wd="C:/test/expdata/" imaging_identification(datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta", preprocess=list(force_preprocess=FALSE), spectra_segments_per_file=9,use_previous_candidates=T,ppm=10,FDR_cutoff = 0.05,IMS_analysis=F, mzrange = c(500,4000),plot_cluster_image_grid=T, img_brightness=250, plot_cluster_image_overwrite=T, cluster_rds_path = "/XXX/preprocessed_imdata.RDS", pixel_size_um = 20, Plot_score_abs_cutoff=-0.1, remove_score_outlier=T, Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"))
library(HiTMaP) datafile=c("XXX/XXX.imzML") wd="C:/test/expdata/"
library(magick) p_cluster1<-image_read(paste0(wd,"/XXX","/Summary folder/cluster Ion images/unique/XXXX_cluster_imaging.png")) print(p_cluster1)
image_write(p_cluster1, path = "C:/test/expdata/XXX/Summary folder/cluster Ion images/unique/p_cluster1.png", format = "png")
Maybe I missed some steps to get the illustrated plot following the instruction? Could you show me how you summarised the protein image from the peptide clustering image? I am using the nat_comms version now. Maybe the latest version can do it smoothly?
Thanks, Xiwei
I have also tried HPC for this functoin. Almost all the codes are similar to the windows one, I just changed the address to accomodate different system. The error log is here:
384 Cores detected, 4 threads will be used for computing
1 files were selected and will be used for Searching
uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode
Found enzyme: trypsin
Found rule: ""
Found customized rule: ""
Candidate list has been loaded.
Protein feature summary...
Protein feature summary...Done.
Peptide feature summary...
Peptide feature summary...Done.
cluster image rendering...
Insufficient mz features to find the outlier
1 Protein(s) found with annotations of interest: GN=XXX
1 Protein(s) found with annotations of interest: GN=XXX
1 Protein(s) found with annotations of interest: GN=XXX
cluster imdata loaded.
GPL Ghostscript 9.53.3: Could not open temporary file /data1/pbs.9781020.pbs/gs_deQwSh
GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_deQwSh.
Unable to open the initial device, quitting.
Error in magick_image_readpath(path, density, depth, strip, defines) :
R: unable to open image 34554_header.png': No such file or directory @ error/blob.c/OpenBlob/2924 Calls: suppressMessages -> withCallingHandlers Execution halted Cluster image rendering Done: **(Protein 1 annotation information)** GPL Ghostscript 9.53.3: **** Could not open temporary file /data1/pbs.9781020.pbs/gs_EBu9er GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_EBu9er. **** Unable to open the initial device, quitting. Error in dev.off() : ignoring SIGPIPE signal Calls: suppressMessages -> withCallingHandlers Execution halted Cluster image rendering Done: **(Protein 2 annotation information)** GPL Ghostscript 9.53.3: **** Could not open temporary file /data1/pbs.9781020.pbs/gs_StATmQ GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_StATmQ. **** Unable to open the initial device, quitting. Error in magick_image_readpath(path, density, depth, strip, defines) : R: unable to open image
42044_header.png': No such file or directory @ error/blob.c/OpenBlob/2924
Calls: suppressMessages -> withCallingHandlers
Execution halted
Cluster image rendering Done: (Protein 3 annotation information)
Workflow done.
Hi, Xiwei, Thanks for the information. Rplots.pdf is a temp file that is produced by the magick when writing things to a png or bitmap device. The files with affix of cluster_imaging.png are the protein cluster images. For the HPC run, have you done a cluster image rendering process on the example data file? Perhaps you need to go to the R terminal on HPC and test if you can start a png()/bitmap() in tempdir() and wd.
Hi George,
I just tested the large sample. just wondering if the "Default" option for peak picking is still unavailable? Or is the enhancement you mentioned early enough for the large database?
Meanwhile, in HPC R terminal, I can use png() in tempdir() and wd, but not bitmap().
Kind regards, Xiwei
Here is the code in HPC R terminal:
[1] "/tmp/Rtmp831P9W"
png() bitmap() Error in bitmap() : 'file' is missing with no default wd [1] "~/expdata/" png() bitmap() Error in bitmap() : 'file' is missing with no default tempdir() [1] "/tmp/Rtmp831P9W" bitmap() Error in bitmap() : 'file' is missing with no default
HI, Xiwei Thanks for the input, I think then you should try the mashuoa/hitmap:largefile. I've configured png in the latest package as the default device format instead of bitmap (which recruits Ghostscript library and may introduce some system path problems). "Default" and "adaptive" are all compatible with the large file PCA/SKM/SCC analysis.
Hi George,
I am sure you have tested it yourself. However, the "Default" option seems not to work for HPC using hitmap_largefile.sif.
Revelant Code is here, others were tested previously:
preprocess=list(force_preprocess=TRUE,
use_preprocessRDS=TRUE,
smoothSignal=list(method="Disable"),
reduceBaseline=list(method="Disable"),
peakPick=list(method="Default"),
peakAlign=list(tolerance=5, units="ppm"),
normalize=list(method=c("rms","tic","reference")[1],mz=1)),
Below is the error log for hitmap_largefile.sif:
384 Cores detected, 50 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Preparing image data for statistical analysis: XXX.imzML Preparing image data for statistical analysis: XXX.imzML Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'process': object 'Default' of mode 'function' was not found Calls: imaging_identification ... peakPick.method2 -> match.fun -> get -> .handleSimpleError -> h Execution halted
Many thanks, Xiwei
Hi, Xiwei, Can you tell me if you are using tolerance greater than 25 ppm? If So, the pre-processing pipeline will go through the TOF data analysis procedure, where the Default method is not available. And please use Thread number <=3 for the large data files.
You could try to use the "adaptive" peak picking method on mid-resolution MS (FWHM> 25 ppm, resolution <30000). The cardinal package has done a great job on such data. please also use the smoothSignal (default "gaussian", you can simply remove this argument from the call) reduceBaseline (default "locmin") to process the raw data before the peak picking.
BTW: complete spectra are preferred in proteomics IMS study. That will preserve the original m/z information until tissue-wide/ regional m/z alignment. This will enhance the annotation sensitivity and accuracy.
Cheers, George
Hi George,
Below is the R file for the large file, I changed the suffix to .txt so that it can be transferred here. Still, the program reported the same problem...
Error log: 384 Cores detected, 1 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Preparing image data for statistical analysis: g1 cartilage-total ion count.imzML Preparing image data for statistical analysis: g1 cartilage-total ion count.imzML Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'process': object 'default' of mode 'function' was not found Calls: imaging_identification ... peakPick.method2 -> match.fun -> get -> .handleSimpleError -> h
I am wondering if there is anything wrong with my code?
Thanks, Xiwei
Hi, Xiwei, Sorry for my slow reply. I've checked the docker and my local build they both work at the moment.
I think there might be two possible reason for this error:
install.packages("remotes")
library(remotes)
Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true")
remotes::install_github("MASHUOA/HiTMaP",force=T)
Hi George,
Thank you for your kind reply! I just had a try this afternoon but the job is still on the waiting list. People just like to submit their jobs on Friday LOL. I will submit a new PBS job then. Will get back to you once it is done.
Many thanks, Xiwei
Hi George,
How are you doing?
Sorry for the late response. I just run a 3-day experiment on HPC (50 thread, 1900G RAM) but failed since it exceeded the time limit. I am wondering why this happens?
Error log: 384 Cores detected, 50 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Loading raw image data for statistical analysis: XXX.imzML Using image data: g1 cartilage-total ion count ID/preprocessed_imdata.RDS Segmentation in progress... preprocess$peakAlign$tolerance missing, use default tolerance in ppm 2.5 nrows changed from 871485 to 64354 =>> PBS: job killed: walltime 259300 exceeded limit 259200
progress log: |======================================================================| 100%
|======================================================================| 100%
|======================================================================| 100%
| | 0%PBS Job 9795419.pbs
R code:
datafile=c("G1/XXX.imzML") wd="~/expdata/"
imaging_identification(
datafile=paste0(wd,datafile),
threshold=0.005,
Thread=50,
ppm=5,
FDR_cutoff = 0.05,
pixel_size_um = 20,
Digestion_site="trypsin",
missedCleavages=0:1,
Fastadatabase="uniprot-proteome_up000005640 human.fasta",
adducts=c("M+H"),
Modifications=list(fixed=NULL,fixmod_position=NULL,variable=NULL,varmod_position=NULL),
Decoy_mode = "isotope",
use_previous_candidates=F,
output_candidatelist=T,
preprocess=list(force_preprocess=FALSE,
use_preprocessRDS=TRUE,
smoothSignal=list(method="gaussian"),
reduceBaseline=list(method="locmin"),
peakPick=list(method="Default"),
peakAlign=list(tolerance=2.5, units="ppm"),
normalize=list(method=c("rms","tic","reference")[1],mz=1)),
spectra_segments_per_file=2,
Segmentation="spatialKMeans",
Smooth_range=1,
Virtual_segmentation=FALSE,
Virtual_segmentation_rankfile=NULL,
score_method="SQRTP",
peptide_ID_filter=2,
Protein_feature_summary=TRUE,
Peptide_feature_summary=TRUE,
Region_feature_summary=TRUE,
plot_cluster_image_grid=TRUE,
ClusterID_colname="Protein",
componentID_colname="Peptide",
Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"),
Rotate_IMG=NULL,
)
library(HiTMaP) datafile=c("G1/XXX.imzML") wd="~/expdata/" imaging_identification(datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta", preprocess=list(force_preprocess=FALSE), spectra_segments_per_file=9,use_previous_candidates=T,ppm=10,FDR_cutoff = 0.05,IMS_analysis=F, mzrange = c(500,4000),plot_cluster_image_grid=T, img_brightness=250, plot_cluster_image_overwrite=T, cluster_rds_path = "/XXX ID/preprocessed_imdata.RDS", pixel_size_um = 20, Plot_score_abs_cutoff=-0.1, remove_score_outlier=T, export_Header_table=T, Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"))
PBS pro email: PBS Job Id: 9795419.pbs Job Name: G1_largefile Execution terminated Exit_status=271 resources_used.cpupercent=116 resources_used.cput=84:00:11 resources_used.mem=1768205664kb resources_used.ncpus=25 resources_used.vmem=61506612168kb resources_used.walltime=72:01:50
HI, Xiwei,
According to the progress message we had, you may have finished the PCA analysis. Can you find some new figures in the specific sample ID folder?
In the parameter setting, please change the Thread to 8 or less. The segmentation repeatedly uses parallel processing; we are using socket RAM mode here, so the RAM will duplicate the memory each time parallel processing was called. Meanwhile, I think you need to config the preprocess as below
preprocess=list(force_preprocess=FALSE, use_preprocessRDS=TRUE, smoothSignal=list(method="gaussian"), reduceBaseline=list(method="locmin"), peakPick=list(method="Default"), peakAlign=list(tolerance=5, units="ppm"), peakFilter=list(freq.min=0.15), normalize=list(method=c("rms","tic","reference")[1],mz=1))
peakAlign$tolerance=5 to exclude potential m/z drift within your dataset (I'm not sure if that was a problem in your data, but since peak alignment yields ~60K features, I believe some features are not well-aligned. I've added a new argument peakFilter=list(freq.min=0.15). It's only applied to the data for statistical analysis. I found you want to segment the images into 2 parts, and this setting will ensure every feature considered in the segmentation will be found at 85% pixels of the whole image. That will make the segmentation more effective and fast. You may need to update the package in your singularity again.
Hi George,
I am Xiwei. Thank you for your help I have successfully run the program using one of my samples. However, it seems unfriendly to another sample since it is a bit large (The ibd file is about 1.6G, I used 20um pixel size). I have tried to use 700G memory with only 1 thread but it still failed because the software reached the memory limit.
I noticed that you have updated another peak picking method. I am wondering if it is helpful to solve the problem? Is it available in singularity now? Or is there any other way to solve this problem? Maybe running every step in a different job file would be better?
Many thanks, Xiwei