ENVI_Model.Train returns nan

ccruizm commented 1 year ago

Good day!

I want to test your tool in my own dataset. It is not clear what the input for the pipeline should be (e.g., raw vs normalized counts for sc and st). Based on the datasets where you tested the pipeline, we require to start with raw counts stored in a dense matrix. I think I have formatted my data to the required prerequisites for ENVI but when I run ENVI_Model.Train I get the following output

Training ENVI for 16384 steps
Trn: spatial Loss: nan, SC Loss: nan, Cov Loss: nan, KL Loss: nan: 100%|███████████████████████| 16384/16384 [42:57<00:00,  6.36it/s]
Finished Training ENVI! - calculating latent embedding, see 'envi_latent' obsm of ENVI.sc_data and ENVI.spatial_data
Finished imputing missing gene for spatial data! See 'imputation' in obsm of ENVI.spatial_data

The 'envi_latent' contains only nan. What am I doing wrong? I could share a subset of the dataset to figure out where the issues reside.

Another issue I have is setting up the tool to use GPU (A100). I see that my device is available:

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available")

    # Get the number of available GPUs
    num_gpus = torch.cuda.device_count()
    print(f"Number of available GPUs: {num_gpus}")

    # Get the name of each available GPU
    for i in range(num_gpus):
        gpu_name = torch.cuda.get_device_name(i)
        print(f"GPU {i}: {gpu_name}")
else:
    print("CUDA is not available")

CUDA is available
Number of available GPUs: 1
GPU 0: NVIDIA A100-SXM4-40GB

I set

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="0"

But still, ENVI does not recognize the GPU. Do you have any advice on how to fix this?

Thanks in advance!

sunericd commented 1 year ago

I am also running into similar issues where the imputation is only NaN values. This seems to happen somewhere along the training where the loss is a real number and then becomes NaN after some number of steps (see training log below):

I am running:

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# run ENVI
ENVI_Model = ENVI.ENVI(spatial_data = spatial_data, sc_data = sc_data)
ENVI_Model.Train()
ENVI_Model.impute()

# get imputation
imputed = ENVI_Model.spatial_data.obsm['imputation']

And this is a snippet of the training log, which completes successfully but returns imputation of only NaNs:

Trn: spatial Loss: -8.73282, SC Loss: -0.54686, Cov Loss: -0.01238, KL Loss: 0.91851:   0%|          | 63/16384 [00:20<1:15:04,  3.62it/s]
Trn: spatial Loss: -8.73282, SC Loss: -0.54686, Cov Loss: -0.01238, KL Loss: 0.91851:   0%|          | 64/16384 [00:21<1:13:31,  3.70it/s]
Trn: spatial Loss: nan, SC Loss: nan, Cov Loss: nan, KL Loss: nan:   0%|          | 64/16384 [00:21<1:13:31,  3.70it/s]                   
Trn: spatial Loss: nan, SC Loss: nan, Cov Loss: nan, KL Loss: nan:   0%|          | 65/16384 [00:21<1:22:38,  3.29it/s]
Trn: spatial Loss: nan, SC Loss: nan, Cov Loss: nan, KL Loss: nan:   0%|          | 66/16384 [00:21<1:20:49,  3.37it/s]

The inputs are the same type as in the tutorial, where spatial_data.X and sc_data.X are both dense float32 numpy arrays. It would be useful to get some insight into why this failure is occurring and if additional preprocessing of the data might be necessary to get ENVI to run on the inputs.

I am also running into similar issues with getting ENVI to run on GPU using only the setup specified in the tutorial examples.

joan-yanqiong commented 11 months ago

I had the same problem when I tried it to run the tutorial on my MBP with M1 Pro chip. Though when I ran the tutorial on a cluster, it did work. Not sure what architecture you're dealing with?

myylee commented 9 months ago

I had the same problem described here. All losses just appear to be nan right from the beginning. I was able to run the tutorial successfully, but when I switched to my own data, it did not work. I passed on raw counts to the model for both spatial_data and sc_data. Would greatly appreciate it if you have any insight in why might be causing this. Thank you!

DoronHav commented 8 months ago

Hi everyone!

We just released an updated version of ENVI based on JAX/FLAX instead of Tensorflow and we took care of the stability issues!

Please try again with the new version!

dpeerlab / ENVI

ENVI_Model.Train returns nan #3