MASHUOA / HiTMaP

An R package of High-resolution Informatics Toolbox for Maldi-imaging Proteomics
GNU General Public License v3.0
15 stars 12 forks source link

Large MSI file compatibility issue #2

Closed xiweifan closed 7 months ago

xiweifan commented 3 years ago

Hi George,

I am Xiwei. Thank you for your help I have successfully run the program using one of my samples. However, it seems unfriendly to another sample since it is a bit large (The ibd file is about 1.6G, I used 20um pixel size). I have tried to use 700G memory with only 1 thread but it still failed because the software reached the memory limit.

I noticed that you have updated another peak picking method. I am wondering if it is helpful to solve the problem? Is it available in singularity now? Or is there any other way to solve this problem? Maybe running every step in a different job file would be better?

Many thanks, Xiwei

MASHUOA commented 3 years ago

Hi, Xiwei Yes, I've rewritten the peak picking method. A "Default" method is available to bin the data using a native peak picking function. The cardinal peak picking function will keep the m/z in a sparse data frame until it gets aligned. But this is not very RAM friendly since we want to perform the peak alignment within the selected region (rather than the whole image) to reduce the misalignment. Can you confirm if it is the segmentation step that requires an extra-large RAM space? If that's the case, you will need to force the segmentation to use top features to conduct the pixel clustering. I will incorporate this function shortly and let you know when this is finished.

xiweifan commented 3 years ago

Hi George,

Thank you so much for your update! Indeed the problem was caused by cardinal and Biocparallel. The RAM is not enough when I tried to run the below commands alone (1.6GB ibd database, 1.2TB RAM and 1Thread), I believe those codes are solely for Data pre-processing and proteomics annotation:

datafile=c("XXX/XXX.imzML") wd="~/expdata/" library(HiTMaP) imaging_identification(Thread=1,datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta",output_candidatelist=T,use_previous_candidates=T, preprocess=list(force_preprocess=TRUE, use_preprocessRDS=TRUE, smoothSignal=list(method="Disable"), reduceBaseline=list(method="Disable"), peakPick=list(method="adaptive"), peakAlign=list(tolerance=5, units="ppm"), normalize=list(method=c("rms","tic","reference")[1],mz=1)), spectra_segments_per_file=2,ppm=5,FDR_cutoff = 0.05,IMS_analysis=T, plot_cluster_image_grid=F)

However, as a software user, I figured out several ways to optimize the file size, including using peak list option, rather than whole spectra in Bruker software to export the imzML file. Limiting the area of the region of interest could also significantly reduce the RAM and process pressure, etc. I have successfully run several examples till now. 300MB ibd file required 346GB RAM, 76MB ibd file required 213GB RAM. Hope this will help other users to accommodate their projects.

As an HPC user, it is appreciated that the updated features can be contained in the docker file so we can use a singularity image to run the program. Thank you so much, and congratulations on your impressive work again!

Thanks, Xiwei

MASHUOA commented 3 years ago

HI, XIwei, I've edited the title of this issue in a clearer way for other readers. I've made the changes to improve the large MSI file statistical analysis performance. Before the PCS/SKM/SCC segmentation analysis, a subset of representative features will be generated through alignment and filtering, covering the major variance but with less size. I've built the Docker image using the latest packages and assigned the tag as mashuoa/hitmap:largefile. Could you please advise me if that can helps with the memory and go through the pipeline?

xiweifan commented 3 years ago

Hi George,

Thank you so much for your timely update! I will have a test ASAP.

Meanwhile, I am wondering if the imaging rendering function also got some little bug? Since I can only produce individual cluster ion images but I can not open the R plot.pdf ( Rplots.pdf ) since the software said there is nothing inside.

Below are the codes I used for imaging rendering in Windows since this step does need large amount of RAM resource, the previous results looks perfect already:

library(HiTMaP) datafile=c("XXX/XXX.imzML") wd="C:/test/expdata/" imaging_identification(datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta", preprocess=list(force_preprocess=FALSE), spectra_segments_per_file=9,use_previous_candidates=T,ppm=10,FDR_cutoff = 0.05,IMS_analysis=F, mzrange = c(500,4000),plot_cluster_image_grid=T, img_brightness=250, plot_cluster_image_overwrite=T, cluster_rds_path = "/XXX/preprocessed_imdata.RDS", pixel_size_um = 20, Plot_score_abs_cutoff=-0.1, remove_score_outlier=T, Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"))

library(HiTMaP) datafile=c("XXX/XXX.imzML") wd="C:/test/expdata/"

library(magick) p_cluster1<-image_read(paste0(wd,"/XXX","/Summary folder/cluster Ion images/unique/XXXX_cluster_imaging.png")) print(p_cluster1)

A tibble: 1 x 5

format width height colorspace matte filesize density

1 PNG 1980 1308 sRGB TRUE 302087 118x118

image_write(p_cluster1, path = "C:/test/expdata/XXX/Summary folder/cluster Ion images/unique/p_cluster1.png", format = "png")

Maybe I missed some steps to get the illustrated plot following the instruction? Could you show me how you summarised the protein image from the peptide clustering image? I am using the nat_comms version now. Maybe the latest version can do it smoothly?

Thanks, Xiwei

xiweifan commented 3 years ago

I have also tried HPC for this functoin. Almost all the codes are similar to the windows one, I just changed the address to accomodate different system. The error log is here:

384 Cores detected, 4 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Candidate list has been loaded. Protein feature summary... Protein feature summary...Done. Peptide feature summary... Peptide feature summary...Done. cluster image rendering... Insufficient mz features to find the outlier 1 Protein(s) found with annotations of interest: GN=XXX 1 Protein(s) found with annotations of interest: GN=XXX 1 Protein(s) found with annotations of interest: GN=XXX cluster imdata loaded. GPL Ghostscript 9.53.3: Could not open temporary file /data1/pbs.9781020.pbs/gs_deQwSh GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_deQwSh. Unable to open the initial device, quitting. Error in magick_image_readpath(path, density, depth, strip, defines) : R: unable to open image 34554_header.png': No such file or directory @ error/blob.c/OpenBlob/2924 Calls: suppressMessages -> withCallingHandlers Execution halted Cluster image rendering Done: **(Protein 1 annotation information)** GPL Ghostscript 9.53.3: **** Could not open temporary file /data1/pbs.9781020.pbs/gs_EBu9er GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_EBu9er. **** Unable to open the initial device, quitting. Error in dev.off() : ignoring SIGPIPE signal Calls: suppressMessages -> withCallingHandlers Execution halted Cluster image rendering Done: **(Protein 2 annotation information)** GPL Ghostscript 9.53.3: **** Could not open temporary file /data1/pbs.9781020.pbs/gs_StATmQ GPL Ghostscript 9.53.3: Could not open the scratch file /data1/pbs.9781020.pbs/gs_StATmQ. **** Unable to open the initial device, quitting. Error in magick_image_readpath(path, density, depth, strip, defines) : R: unable to open image42044_header.png': No such file or directory @ error/blob.c/OpenBlob/2924 Calls: suppressMessages -> withCallingHandlers Execution halted Cluster image rendering Done: (Protein 3 annotation information) Workflow done.

MASHUOA commented 3 years ago

Hi, Xiwei, Thanks for the information. Rplots.pdf is a temp file that is produced by the magick when writing things to a png or bitmap device. The files with affix of cluster_imaging.png are the protein cluster images. For the HPC run, have you done a cluster image rendering process on the example data file? Perhaps you need to go to the R terminal on HPC and test if you can start a png()/bitmap() in tempdir() and wd.

xiweifan commented 3 years ago

Hi George,

I just tested the large sample. just wondering if the "Default" option for peak picking is still unavailable? Or is the enhancement you mentioned early enough for the large database?

Meanwhile, in HPC R terminal, I can use png() in tempdir() and wd, but not bitmap().

Kind regards, Xiwei

xiweifan commented 3 years ago

Here is the code in HPC R terminal:

[1] "/tmp/Rtmp831P9W"

png() bitmap() Error in bitmap() : 'file' is missing with no default wd [1] "~/expdata/" png() bitmap() Error in bitmap() : 'file' is missing with no default tempdir() [1] "/tmp/Rtmp831P9W" bitmap() Error in bitmap() : 'file' is missing with no default

MASHUOA commented 3 years ago

HI, Xiwei Thanks for the input, I think then you should try the mashuoa/hitmap:largefile. I've configured png in the latest package as the default device format instead of bitmap (which recruits Ghostscript library and may introduce some system path problems). "Default" and "adaptive" are all compatible with the large file PCA/SKM/SCC analysis.

xiweifan commented 3 years ago

Hi George,

I am sure you have tested it yourself. However, the "Default" option seems not to work for HPC using hitmap_largefile.sif.

Revelant Code is here, others were tested previously:

==============The pre-processing param

           preprocess=list(force_preprocess=TRUE,
                           use_preprocessRDS=TRUE,
                           smoothSignal=list(method="Disable"),
                           reduceBaseline=list(method="Disable"),
                           peakPick=list(method="Default"),
                           peakAlign=list(tolerance=5, units="ppm"),
                           normalize=list(method=c("rms","tic","reference")[1],mz=1)),

Below is the error log for hitmap_largefile.sif:

384 Cores detected, 50 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Preparing image data for statistical analysis: XXX.imzML Preparing image data for statistical analysis: XXX.imzML Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'process': object 'Default' of mode 'function' was not found Calls: imaging_identification ... peakPick.method2 -> match.fun -> get -> .handleSimpleError -> h Execution halted

Many thanks, Xiwei

MASHUOA commented 3 years ago

Hi, Xiwei, Can you tell me if you are using tolerance greater than 25 ppm? If So, the pre-processing pipeline will go through the TOF data analysis procedure, where the Default method is not available. And please use Thread number <=3 for the large data files.

You could try to use the "adaptive" peak picking method on mid-resolution MS (FWHM> 25 ppm, resolution <30000). The cardinal package has done a great job on such data. please also use the smoothSignal (default "gaussian", you can simply remove this argument from the call) reduceBaseline (default "locmin") to process the raw data before the peak picking.

BTW: complete spectra are preferred in proteomics IMS study. That will preserve the original m/z information until tissue-wide/ regional m/z alignment. This will enhance the annotation sensitivity and accuracy.

Cheers, George

xiweifan commented 3 years ago

Hi George,

Below is the R file for the large file, I changed the suffix to .txt so that it can be transferred here. Still, the program reported the same problem...

XX_largefile.txt

Error log: 384 Cores detected, 1 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Preparing image data for statistical analysis: g1 cartilage-total ion count.imzML Preparing image data for statistical analysis: g1 cartilage-total ion count.imzML Error in h(simpleError(msg, call)) : error in evaluating the argument 'object' in selecting a method for function 'process': object 'default' of mode 'function' was not found Calls: imaging_identification ... peakPick.method2 -> match.fun -> get -> .handleSimpleError -> h

I am wondering if there is anything wrong with my code?

Thanks, Xiwei

MASHUOA commented 3 years ago

Hi, Xiwei, Sorry for my slow reply. I've checked the docker and my local build they both work at the moment.

I think there might be two possible reason for this error:

  1. singularity doesn't rebuild the image if there's no new layer added. Thus, the updated hitmap package han't been retrieved from github and installed. will you please update the package from singularity R session via code as below: install.packages("remotes") library(remotes) Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true") remotes::install_github("MASHUOA/HiTMaP",force=T)
  2. within singularity, the parameter transfering may automatically convert the letters into lower case. I noticed that your config file uses "Default" where the output error messsage says "default". I've added a regex to match this option in a case insensitive way. After the update you may want to save current signularity container to a new image for later use. Hope that can help through the analysis. Best George
xiweifan commented 3 years ago

Hi George,

Thank you for your kind reply! I just had a try this afternoon but the job is still on the waiting list. People just like to submit their jobs on Friday LOL. I will submit a new PBS job then. Will get back to you once it is done.

Many thanks, Xiwei

xiweifan commented 3 years ago

Hi George,

How are you doing?

Sorry for the late response. I just run a 3-day experiment on HPC (50 thread, 1900G RAM) but failed since it exceeded the time limit. I am wondering why this happens?

Error log: 384 Cores detected, 50 threads will be used for computing 1 files were selected and will be used for Searching uniprot-proteome_up000005640 human.fasta was selected as database. Candidates will be generated through Proteomics mode Found enzyme: trypsin Found rule: "" Found customized rule: "" Testing fasta sequances for degestion site: (KR)|((?<=W)K(?=P))|((?<=M)R(?=P)) Generated 78120 Proteins in total. Computing exact masses... Generating peptide formula... Generating peptide formula with adducts: M+H Calculating peptide mz with adducts: M+H Candidate list has been exported. uniprot-proteome_up000005640 human.fasta was selected as database Spectrum intensity threshold: 0.50% mz tolerance: 5 ppm Segmentation method: spatialKMeans Manual segmentation def file: None Bypass spectrum generation: FALSE Found rotation info Loading raw image data for statistical analysis: XXX.imzML Using image data: g1 cartilage-total ion count ID/preprocessed_imdata.RDS Segmentation in progress... preprocess$peakAlign$tolerance missing, use default tolerance in ppm 2.5 nrows changed from 871485 to 64354 =>> PBS: job killed: walltime 259300 exceeded limit 259200

progress log: |======================================================================| 100%

|======================================================================| 100%

|======================================================================| 100%

| | 0%PBS Job 9795419.pbs

R code:

set project folder that contains imzML, .ibd and fasta files

wd=paste0(file.path(path.package(package="HiTMaP")),"/data/")

set a series of imzML files to be processed

datafile=c("G1/XXX.imzML") wd="~/expdata/"

imaging_identification(

==============Choose the imzml raw data file(s) to process make sure the fasta file in the same folder

           datafile=paste0(wd,datafile),
           threshold=0.005,
           Thread=50,
           ppm=5,
           FDR_cutoff = 0.05,
           pixel_size_um = 20,

==============specify the digestion enzyme specificity

           Digestion_site="trypsin",

==============specify the range of missed Cleavages

           missedCleavages=0:1,

==============Set the target fasta file

           Fastadatabase="uniprot-proteome_up000005640 human.fasta",

==============Set the possible adducts and fixed modifications

           adducts=c("M+H"),

Modifications=list(fixed=NULL,fixmod_position=NULL,variable=NULL,varmod_position=NULL),

==============The decoy mode: could be one of the "adducts", "elements" or "isotope"

           Decoy_mode = "isotope",
           use_previous_candidates=F,
           output_candidatelist=T,

==============The pre-processing parameter (I have preprocessing file from previous run so I choosed force_preprocess=FALSE )

           preprocess=list(force_preprocess=FALSE,
                           use_preprocessRDS=TRUE,
                           smoothSignal=list(method="gaussian"),
                           reduceBaseline=list(method="locmin"),
                           peakPick=list(method="Default"),
                           peakAlign=list(tolerance=2.5, units="ppm"),
                           normalize=list(method=c("rms","tic","reference")[1],mz=1)),

==============Set the parameters for image segmentation

           spectra_segments_per_file=2,
           Segmentation="spatialKMeans",
           Smooth_range=1,
           Virtual_segmentation=FALSE,
           Virtual_segmentation_rankfile=NULL,

==============Set the Score method for hi-resolution isotopic pattern matching

           score_method="SQRTP",
           peptide_ID_filter=2,

==============Summarise the protein and peptide features across the project the result can be found at the summary folder

           Protein_feature_summary=TRUE,
           Peptide_feature_summary=TRUE,
           Region_feature_summary=TRUE,

==============The parameters for Cluster imaging. Specify the annotations of interest, the program will perform a case-insensitive search on the result file, extract the protein(s) of interest and plot them in the cluster imaging mode

           plot_cluster_image_grid=TRUE,
           ClusterID_colname="Protein",
           componentID_colname="Peptide",
           Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"),
           Rotate_IMG=NULL,
           )

library(HiTMaP) datafile=c("G1/XXX.imzML") wd="~/expdata/" imaging_identification(datafile=paste0(wd,datafile),Digestion_site="trypsin", Fastadatabase="uniprot-proteome_up000005640 human.fasta", preprocess=list(force_preprocess=FALSE), spectra_segments_per_file=9,use_previous_candidates=T,ppm=10,FDR_cutoff = 0.05,IMS_analysis=F, mzrange = c(500,4000),plot_cluster_image_grid=T, img_brightness=250, plot_cluster_image_overwrite=T, cluster_rds_path = "/XXX ID/preprocessed_imdata.RDS", pixel_size_um = 20, Plot_score_abs_cutoff=-0.1, remove_score_outlier=T, export_Header_table=T, Protein_desc_of_interest=c("GN=XXX","GN=XXX","GN=XXX"))

PBS pro email: PBS Job Id: 9795419.pbs Job Name: G1_largefile Execution terminated Exit_status=271 resources_used.cpupercent=116 resources_used.cput=84:00:11 resources_used.mem=1768205664kb resources_used.ncpus=25 resources_used.vmem=61506612168kb resources_used.walltime=72:01:50

MASHUOA commented 3 years ago

HI, Xiwei, According to the progress message we had, you may have finished the PCA analysis. Can you find some new figures in the specific sample ID folder? In the parameter setting, please change the Thread to 8 or less. The segmentation repeatedly uses parallel processing; we are using socket RAM mode here, so the RAM will duplicate the memory each time parallel processing was called. Meanwhile, I think you need to config the preprocess as below preprocess=list(force_preprocess=FALSE, use_preprocessRDS=TRUE, smoothSignal=list(method="gaussian"), reduceBaseline=list(method="locmin"), peakPick=list(method="Default"), peakAlign=list(tolerance=5, units="ppm"), peakFilter=list(freq.min=0.15), normalize=list(method=c("rms","tic","reference")[1],mz=1))

peakAlign$tolerance=5 to exclude potential m/z drift within your dataset (I'm not sure if that was a problem in your data, but since peak alignment yields ~60K features, I believe some features are not well-aligned. I've added a new argument peakFilter=list(freq.min=0.15). It's only applied to the data for statistical analysis. I found you want to segment the images into 2 parts, and this setting will ensure every feature considered in the segmentation will be found at 85% pixels of the whole image. That will make the segmentation more effective and fast. You may need to update the package in your singularity again.