Irrationone / cellassign

Automated, probabilistic assignment of cell types in scRNA-seq data
Other
191 stars 79 forks source link

runtime issue #41

Open chenlingantelope opened 4 years ago

chenlingantelope commented 4 years ago

I am running cellassign on a dataset I tried to use your method on a dataset with 19684 cells, 316 marker genes, and 58 cell types (in theory I have more marker genes too). It has been running for a whole day without signs of finishing, so I stopped the process in case I am doing something wrong.

I got the following warning:

2019-09-12 17:30:30.736025: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

and my session info is the following:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /data/yosef2/users/chenling/miniconda3/envs/r_env/lib/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
 [3] DelayedArray_0.8.0          BiocParallel_1.16.6        
 [5] matrixStats_0.55.0          Biobase_2.42.0             
 [7] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
 [9] IRanges_2.16.0              S4Vectors_0.20.1           
[11] BiocGenerics_0.28.0         cellassign_0.99.2          
[13] Matrix_1.2-17              

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2             whisker_0.4            XVector_0.22.0        
 [4] magrittr_1.5           zlibbioc_1.28.0        lattice_0.20-38       
 [7] tools_3.5.1            grid_3.5.1             tfruns_1.4            
[10] tensorflow_1.14.0      GenomeInfoDbData_1.2.0 base64enc_0.1-3       
[13] bitops_1.0-6           RCurl_1.95-4.12        glue_1.3.1            
[16] compiler_3.5.1         reticulate_1.13        jsonlite_1.6          

And my system specs:

Linux s128 4.13.0-45-generic #50~16.04.1-Ubuntu SMP Wed May 30 11:18:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

kieranrcampbell commented 4 years ago

Hi @chenlingantelope thanks for bringing this onto github.

We have runtime comparisons in the supplement that go up to 80k cells, but never for this many cell types.

Could you try running again, but we'll be a little less stringent with the number of iterations, so try with options max_iter_adam = 500 and max_iter_em = 10

and we can see how far it gets? If you could plot ca$lls after this (where ca has been returned by cellassign, lls slot is the log marginal likelihood), and we can see how far it is from convergence

Thanks

chenlingantelope commented 4 years ago

Hi Kieran, sorry it took me so long to get back to you. I set the job running and forgot about it :/ and forgot to track the runtime. But it did finish and gave me these results. I'll update more when I look at the results and also actually tracks the time.

OMP: Info #250: KMP_AFFINITY: pid 50717 tid 50717 thread 0 bound to OS proc set 0
2019-09-17 18:17:02.934901: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2019-09-17 18:17:03.207919: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
500     L old: -75231210.8058247; L new: -6488883.9807309; Difference (%): 0.913747447220024
500     L old: -6488883.9807309; L new: -3568404.58913792; Difference (%): 0.450074219274917
500     L old: -3568404.58913792; L new: -2886661.93381202; Difference (%): 0.191049708152796
500     L old: -2886661.93381202; L new: -2691796.21185211; Difference (%): 0.0675055570856471
500     L old: -2691796.21185211; L new: -2625265.01524139; Difference (%): 0.0247162828737863
500     L old: -2625265.01524139; L new: -2592510.94299985; Difference (%): 0.0124764822032732
500     L old: -2592510.94299985; L new: -2572123.04324677; Difference (%): 0.0078641518594664
500     L old: -2572123.04324677; L new: -2559589.19324348; Difference (%): 0.00487295894968909
260     L old: -2559589.19324348; L new: -2555384.44023848; Difference (%): 0.00164274525619225
60      L old: -2555384.44023848; L new: -2554551.41099752; Difference (%): 0.000325989791533641
Warning message:
In cellassign(exprs_obj = raw_counts[nonzero_cells, ismarker], marker_gene_info = sig[ismarker,  :
  You have specified 316 input genes. Are you sure these are just your markers? Only the marker genes should be used as input

Would you consider this as having converged? and how many markers do you typically run this analysis with? Thanks!

kieranrcampbell commented 4 years ago

It looks approximately converged, but it would be best to plot the log likelihood as a function of iterations (I think you can do with just plot(ca$lls) ) and paste here to assess?

You can also plot the specificity of marker expression for each cell type, which in scater would be something like plotExpression(sce, x = 'cellassign_cell_type', features = your_marker_genes)

We typically run with 50- ish markers, but there's no reason it shouldn't run with more. You even see the warning You have specified 316 input genes. Are you sure these are just your markers? Only the marker genes should be used as input where we assume if people input > 100 marker genes that they've accidentally entered a transcriptome wide dataset. But no reason to expect it won't work with this many markers. Might just be careful not to include too many cell types you don't expect to be in the dataset!

olechnwin commented 4 years ago

Hi @kieranrcampbell Can you please explain why you mentioned "Might just be careful not to include too many cell types you don't expect to be in the dataset!". I was actually thinking about using all the cell markers in http://biocc.hrbmu.edu.cn/CellMarker/help.jsp since the paper mentioned it maintained high accuracy when too many cell types are specified. Thank you in advance.

chenlingantelope commented 4 years ago

Hi Kieran,

I finally reran the process (on interactive) and it took in total 42.8955 hours to finish with the 58 cell types and 316 genes. This was done on CPU though, because I had some installing issues on our GPU server. What is the expected speedup with GPU? Do you have specific instructions on making sure that the R tensorflow package finds the corresponding python tensorflow that uses GPU?

Both the log likelihood and the marker gene expression file looks good (I had to binarize the celltypes because with 58 celltypes the plots looked like a mess) There are still a lot of cells expressing the markers highly but are not classified as the particular celltype, but the distribution looks different enough.

What do you suggest as the next step?

Club.expression.pdf loglikelihood.pdf

kieranrcampbell commented 4 years ago

@olechnwin it will work best (in both time and accuracy) when the number of cell types specified is roughly equivalent to what's actually in the data. It's okay to have a few cell types in the input to cellassign that aren't in the data, but it would make most sense to try and only include cell types you reasonably expect (e.g. if you have a breast primary tumour sample, don't include retinal progenitor cell types, etc...). Does that make sense?

kieranrcampbell commented 4 years ago

@chenlingantelope that all looks good and your likelihood looks converged. 42 hours still seems ridiculously long, we'll have a look into speeding it up. GPU speedup should be 10-100 fold