Irrationone / cellassign

Automated, probabilistic assignment of cell types in scRNA-seq data
Other
192 stars 81 forks source link

Reproducing Paper #69

Open chanwkimlab opened 4 years ago

chanwkimlab commented 4 years ago

Hello Firstly, I'd like to thank you for publishing the useful cell assignment framework.

During porting the current r-TensorFlow implementation on PyTorch for my personal use, a few questions arose.


1. Inconsistency between the implementation and statements in the paper.

The methods section of the publication says that

a is initialized to 0. delta, beta, a, b is optimized during M-step

But, in the current implementation a is initialized to 1, and b is a fixed vector as in the following code.

# Spline variables
a <- tf$exp(tf$Variable(tf$zeros(shape = B, dtype = tf$float64)))
b <- tf$exp(tf$constant(rep(-log(b_init), B), dtype = tf$float64))

(https://github.com/Irrationone/cellassign/blob/master/R/inference-tensorflow.R#L114) Is this inconsistency a trivial one for the performance of assignment?

2. The speed of CellAssign

I have run CellAssign using HumanLiver dataset(8444 cells x 63 markers, 12 categories), which was used in the publication, on my GPU workstation (RTX-2080 x 6) with the following default setting. It took 20 minutes to complete.

cellassign(exprs_obj = t(exp_data[rownames(marker_mat),]),
                  marker_gene_info=marker_mat,
                  s=cell_size_factor_cluster,
                  X=cbind(1,batch_onehot[,1:4])
                 )

However, the strange thing is that as I check the utilization of GPU using nvidia-smi command, CellAssign occupied process of 111MB on my every GPU instances, and the GPU utilization was 0. But, the CPU utilization was high instead. I cannot see any valid reason for CellAssign's occupying a small amount of memory(process) for all of the GPUs.

I think that this is nothing to do with problems of my GPU hardware, since I often make use of this GPU workstation with PyTorch well. I guessed that the possible causes of this phenomenon would be

  1. problem with my Tensorflow installation(CPU, GPU version) 2. problem with r-Tensorflow GPU instance configuration 3. inefficiency of r-TensorFlow.

For the possible reason 1, I reinstalled the Tensorflow with the following command,

library(tensorflow)
install_tensorflow(version = "2.1-gpu")

However, the speed became even slower(x2, x3) compared to the older trials. I cannot remember how I installed the r-tensorflow that was used with my older trials.

I feel that it is quite tricky to install r-tensorflow and configure it correctly.

Is this degree of time consumption(20mins) for input of HumanLiver dataset size usual?


Could you give me some advice? Thank you for your help in advance.

Best regards, Chanwoo Kim

Irrationone commented 4 years ago

Hi Chanwoo,

Thanks for your interest:

  1. This is a mistake in the methods section -- b has always been fixed for the paper. In practice, this formulation of the dispersion term in the negative binomial shouldn't make a substantial difference in cell type assignments. One of the very old development versions of CellAssign specified free dispersions, and gave similar results on the solid tumour data. One of the major advantages of this formulation is speed. a cannot be allowed to be zero in this formulation, as the entire dispersion term would equal 0.

  2. This seems like a tensorflow issue -- does your re-install with tensorflow-gpu result in tensorflow actually using the GPU (not just GPU memory)? Did you uninstall the CPU version of tensorflow before installing the GPU version?

Allen