Open ssobt opened 10 months ago
Hi @ssobt
Looks like the main issue is indeed that a sample has many areas have very low counts. I would address that issue rather than figuring out how to input nuclei counts.
Please don't use older versions (cell2location.run_cell2location
) because they are not supported and don't correspond to the published paper.
We don't support nuclei count use because we did not find providing that information useful in benchmarks and it is not available for many datasets. Does the analysis work well and provide the expected results with v.02-alpha?
That said, I don't see why providing a 2D shape=(obs, 1) array to N_cells_per_location should be a problem. What error do you see in the latest version?
Actually, I see a problem with the latest version. This line https://github.com/BayraktarLab/cell2location/blob/b2f38944dd13f3b3024e10f97abaf3240f6cccfe/cell2location/models/_cell2location_model.py#L91 needs to be changed to support both scalar values and arrays. Feel free to contribute a PR.
Hi, thanks for the quick response. I don't see why that line in the last comment would cause problems. It would be a scalar divided by an array divided by another scalar so detection_mean_
should come out as an array, which shouldn't be a problem, right?
For some of your questions:
The results for v.02-alpha using a scalar value for 'cells_per_spot' wasn't able to call the low RNA content areas similar to the latest version. I tried running the latest version (v.0.1.3) with a dummy 1d array np.random.randint(1,4,2163)
and 2d array np.random.randint(1,4,2163).reshape(2163,1)
without any luck. The errors were different each time but I've put them both below. Is there anything I can do to remedy them since I don't think the line you mentioned is what's causing the problem, and thank you for taking such a thorough look at the data.
alpha 0.5
alpha 5
alpha 20
Whether the problem have be sloved? I am interested in using the N_cells_per_location function by inputing nuclei counts.
Hi @huiyijiangling @ssobt
We are working on incorporating this information at the moment. It is not as simple as changing the above line but requires substantial changes to the model to effectively use segmentation-derived N_cells_per_location. While this will become possible in a month or so - you need to keep in mind that segmentation is not possible for all datasets and it is mostly reliable for FFPE protocols.
Also, when you provide segmentation information detection_alpha
has to be large, eg 200.
Segmentation information and large detection_alpha=200
would likely become the new recommended setting.
Hi @vitkl
First of all, thank you for your work !
I'm also interested so If you have any news about your previous comments (the possibility to input the number of cells per-spot instead of a sample-wise value), let us know 😄
Regards, Benoit
Hi @vitkl !
Do you have any news ? 😄
Hi @benoitsam
You can try using this experimental branch https://github.com/BayraktarLab/cell2location/pull/337#issuecomment-2422459954. I am planning to finalise this branch and its dependencies (scvi-tools) by December-February.
Hi @vitkl
Thanks for your reply ! I'll try to follow your instructions to install and use this new feature asap and I'll get back here with the results 😄
Hi @vitkl
Quick comments for the #337 :
There was an error with the scvi-tools that you forked. They removed the optional scvi.criticism package so a dirty fix was to do :
git clone https://github.com/vitkl/scvi-tools.git --single-branch --branch pyro_fixes
pyproject.toml : Removed "criticism" line 90
pip install ../scvi-tools
And then for cell2location with your ongoing branch :
pip install "cell2location[tutorials] @ git+https://github.com/BayraktarLab/cell2location.git@hires_sliding_window"
It works if I do import cell2location
in python interpreter.
Could you elaborate on your comment about your N_cells_per_location comment ?
# ideally this is not count of cells
# but % of spot occupied by cells * 0.9999 quantile of N cells across the data
I'm not sure to understand what I'm supposed to use as input because I've got only a count of cells by spot 🤔
Regards, Benoit
Thanks for suggesting the fix. Good to know.
The idea is that cell abundance is proportional to the number of cells and % of the spot occupied by cells - so combining the two measures gives a better result.
You can use a count of cells by spot too. You need to delete spots with 0 cells.
Hi @vitkl
Just to let you know, I managed to use this version on my laptop on a toy dataset. However, I'm currently trying to use it on a real sample (nextflow pipeline on GPU cluster with SLURM) and I encounter some memory issues.
Run info:
sample=3000 spots
use_gpu=True
Nb intersecting genes : 14358
RAM: between 60 to 120 Go -> OOM
For cell2location parameters :
max_epochs=30000
posterior_sampling=1000
The "out of memory" issue appears every time after the training completed (even with 30000 iterations, I've got the Trainer.fit stopped: max_epochs=30000 reached.
) but during the start of the export_posterior
method.
The job ends correctly if I use max_epochs
and posterior_sampling
with low value like 10.
I wondered if you suspect that your modifications may have impacted the resources required to run cell2location. (Because I managed to use the "classical" version on same cluster with same parameters)
Regards, Benoit
Hi @benoitsam
It looks like the issue is with posterior sampling rather than training, and you run out of RAM, not GPU memory, right?
The resource change may be due to the new version rather than to using these settings. Do you mean that you are using old parameters with new code?
In general, I would recommend computing quantiles directly like this:
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={
'batch_size': int(np.ceil(adata_vis.n_obs / 8)), # this has to be done in batches due to a bug in the code new version
'accelerator': 'gpu',
'return_observed': False,
},
add_to_obsm=['q05', 'q95', 'q50'],
use_quantiles=True,
)
It looks like the issue is with posterior sampling rather than training, and you run out of RAM, not GPU memory, right?
Yes it seems to be an Out of Memory from RAM. I had no warning or error log about GPU memory or CUDA issues.
The resource change may be due to the new version rather than to using these settings. Do you mean that you are using old parameters with new code?
I meant that I used the same "configuration" when I used cell2location (v0.1.4) with N_cells_per_location
for the whole sample. "Same configuration" as same cluster, nextflow, SLURM, epoch for training etc...
I tried to use the v0.1.5 by adapting the code and following your comments about the various changes.
In general, I would recommend computing quantiles directly like this:
I tried your suggestion. It works locally with the toy dataset (1700 spots, 10 epoch for the training). But with the real sample I've got :
Traceback (most recent call last):
File "/sps/lbmc/bsamson/vap/subworkflows/deconvolution/cell2location/fit_model_prior_by_spot.py", line 210, in <module>
main()
File "/sps/lbmc/bsamson/vap/subworkflows/deconvolution/cell2location/fit_model_prior_by_spot.py", line 185, in main
adata_vis = mod.export_posterior(
File "/pbs/throng/lbmc/bsamson/software/miniconda3/envs/cell2loc_prior_by_spot_env/lib/python3.10/site-packages/cell2location/models/_cell2location_model.py", line 520, in export_posterior
self.samples[f"post_sample_{i}"] = self.posterior_quantile(q=q, **sample_kwargs)
File "/pbs/throng/lbmc/bsamson/software/miniconda3/envs/cell2loc_prior_by_spot_env/lib/python3.10/site-packages/cell2location/models/base/_pyro_mixin.py", line 570, in posterior_quantile
return self._posterior_quantile_minibatch(exclude_vars=exclude_vars, batch_size=batch_size, **kwargs)
File "/pbs/throng/lbmc/bsamson/software/miniconda3/envs/cell2loc_prior_by_spot_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/pbs/throng/lbmc/bsamson/software/miniconda3/envs/cell2loc_prior_by_spot_env/lib/python3.10/site-packages/cell2location/models/base/_pyro_mixin.py", line 444, in _posterior_quantile_minibatch
valid_sites = self._get_valid_sites(args, kwargs, return_observed=return_observed)
AttributeError: 'Cell2location' object has no attribute '_get_valid_sites'
EDIT:
It seems the error comes from me 😄
This method exists in your pyro_fixes
branch of scvi-tools and is called locally mon my laptop.
It must be an installation mixup of scvi-tools on my cluster 🤔
At last, it worked on my cluster for real samples 👍
Hi @vitkl !
I have some questions for you 😄
hires_sliding_window
branch. Does it only impact the spatial mapping part ?
If I already estimated cell-types signatures from a single-reference with cell2location (main
branch), I can use the same "model" for the mapping part with this hires_sliding_window
branch (that work the number of cells / spot as a prior instead of a sample-wise one) ?Regards, Benoit
Hi, thank you for this tool! I have a question about entering in cell counts. I’m using an older version of cell2location (v.02-alpha) to input in nuclei counts for the 'the expected number of cells per location' hyperparamter. We’re having some trouble getting the latest version (v.0.1.3) to assign cell probabilities to most of the tissue due to high RNA variability after trying both 20 and 200 for alpha (see image below for alpha 200). Areas with low RNA content have very low probabilities assigned for any of the reference cell types. To try to alleviate the problem, we switched to the older version to input custom cell/nuclei counts. In v.02-alpha, I have inputted in a 1-dimensional numpy array with the nuclei counts of each spot (made from concatenating rows of 2d x,y array) on the Visium slide, the following error occurs asking for one value instead of locations specific values:
Gamma has no finite default value to use, checked: ('median', 'mean', 'mode'). Pass testval argument or adjust so value is finite.
I tried entering the 2d array directly and got the same error. The model only started to run when I entered one integer, so I was wondering how to input nuclei counts per each spot/location? Any advice on this would be great, thanks!
Here is the model setup: