Ideal hyperparams scArches

epaaso commented 1 year ago

[x] Epochs, ELBO, batch size can be hyptertuned with https://docs.scvi-tools.org/en/stable/tutorials/notebooks/tuning/autotune_scvi.html
- Algo que no esta ahi es la idea de hacer una ensemble...
[x] Ideal number of epochs
- Log the variance of loss depending on the number of genes, cells and avg variance
[x] Elbow training?
- This refers to the ELBO which is specific to VAE's and is the lwer bound for the posterior porbability, we can use thsi metric to see if our model is faring well. More on ELBO below
[x] Batch size
- The batch size can be as big as the GPU Ram allows it to be but one may lose some generalization. This happens, 256 has more accuracy.
[x] Num workers
- It is fine to use as many workers as you have cpus available, but this will raise the used memory a lot.
- Also we don't really know how to change the num_workers. When using scvi directly it can be done directly with scvi.settings.num_workers_dn, But we didn't see this work. Maybe it has to do with the order of loading, We still have to figure it out.
- This is an example of how to do it:
```
from threadpoolctl import threadpool_limits
with threadpool_limits(limits=30, user_api='blas'):
# Creating very large arrays
arr1 = np.random.rand(10000, 10000)
arr2 = np.random.rand(10000, 10000)
```
Performing lots of math that should utilize multiple cores

result = np.dot(np.linalg.inv(arr1), arr2)

epaaso commented 11 months ago

The model will be trained for a given number of epochs, a training iteration where every cell is passed through the network. By default scVI uses the following heuristic to set the number of epochs. For datasets with fewer than 20,000 cells, 400 epochs will be used and as the number of cells grows above 20,000 the number of epochs is continuously reduced. The reasoning behind this is that as the network sees more cells during each epoch it can learn the same amount of information as it would from more epochs with fewer cells.

Implement this in your notebooks with: max_epochs_scvi = np.min([round((20000 / adata.n_obs) * 400), 400]) max_epochs_scvi

With our 400,000 datasests the epochos result to be only 20... this does not achieve optimum accuracy.

epaaso commented 3 months ago

From https://docs.scarches.org/en/latest/training_tips.html. This is where the latent dimensions are recommended:

Regarding architecture always try with the default one ([128,128], z_dimension=10) and check the results. If you have more complicated data sets with many datasets and conditions and etc then you can increase the depth ([128,128,128] or [128,128,128,128]). According to our experiments, small values of z_dimension between 10 (default) and 20 are good.

epaaso commented 2 months ago

Using 3 layers and training for 900 epochs instead of 300 epochs I managed to get 90% accuracy instead of 71%. I think it was also due to stopping and starting again every 300 epochs as the learning rate may have a gamma distribution scheduler for the learning rate.

Nevertheless this still predicted the cells in Zuani dataset very wrong. Now I will check if it predicts them wrong in Deng dataset again.

It predicted wrong in Deng, but because we were only training on tumor cells.

epaaso commented 1 month ago

Also consider that the HCLA atlas did not correct for sample, as they wanted to maintain variability. We are not coreccting for sample, but for dataset.

epaaso / sc-luca-explore

Ideal hyperparams scArches #1

Performing lots of math that should utilize multiple cores