ML4GLand / EUGENe

Elucidating the Utility of Genomic Elements with Neural Nets
MIT License
65 stars 4 forks source link

deAlmeida22 re-analysis error #41

Closed BartTheeuwes closed 11 months ago

BartTheeuwes commented 11 months ago

First off, thank you for creating this package, I'm hoping it will speed up some of the analysis that I am planning to do. As a trial run, I was hoping to re-analyse the deAlmeida22 data in the same way as in the original paper. However, I'm running into multiple issues along the way.

Here is the code that I tried to run to analyse the data and the output describing the error. (downloading the data using seqdatasets.deAlmeida22() failed to correctly load in the sequence data, hence the strange workaround I implemented)

# Change this to where you would like to save all your results
import os
os.chdir("/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR")  # TODO: change this to your own directory
cwd = os.getcwd()
cwd

# Configure EUGENe directories, if you do not set these, EUGENe will use the default directories
from eugene import settings
settings.logging_dir = "./tutorial_logs" # Directory where EUGENe will save Tensorboard training logs and model checkpoints to
settings.output_dir = "./tutorial_output" # Directory where EUGENe will save output files to

#### Data

import seqdatasets
sdata_train = seqdatasets.deAlmeida22(dataset='train')
sdata_val = seqdatasets.deAlmeida22(dataset='val')
sdata_test = seqdatasets.deAlmeida22(dataset='test')

from eugene import preprocess as pp
# Something goes wrong with the ZARR creation so have to manually change the seq folder in the zarr... no idea why this goes wrong
import seqdata as sd
sdata = sd.read_flat_fasta(
                name="seq",
                out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_train.zarr',
                fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Train.fa',
                batch_size=128,
                fixed_length=True,
                overwrite=True
            )
sdata = sd.read_flat_fasta(
                name="seq",
                out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_val.zarr',
                fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Val.fa',
                batch_size=128,
                fixed_length=True,
                overwrite=True
            )
sdata = sd.read_flat_fasta(
                name="seq",
                out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_test.zarr',
                fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Test.fa',
                batch_size=128,
                fixed_length=True,
                overwrite=True
            )

sdata_train = seqdatasets.deAlmeida22(dataset='train')
sdata_val = seqdatasets.deAlmeida22(dataset='val')
sdata_test = seqdatasets.deAlmeida22(dataset='test')

# One hot encode all the sequences in the sdata using the wrapper function
pp.ohe_seqs_sdata(sdata_train, alphabet="DNA")
pp.ohe_seqs_sdata(sdata_val, alphabet="DNA")
pp.ohe_seqs_sdata(sdata_test, alphabet="DNA")

# Make unique ids for each sequence in the sdata
pp.make_unique_ids_sdata(sdata_train)
pp.make_unique_ids_sdata(sdata_val)
pp.make_unique_ids_sdata(sdata_test)

#########
## CANT FIGURE OUT HOW TO COMBINE TRAIN AND VAL & ADD COLUMN SO JUST SPLITTING TRAIN UP AGAIN
#########
# Split the training set into training and validation
pp.train_test_random_split(sdata_train, dim="_sequence", train_var="train_val", test_size=0.1)

model = zoo.DeepSTARR(
    input_len=249,
    output_dim=2
)

#### Model
from eugene.models import zoo
from eugene import models

model = models.SequenceModule(
    arch=model,
    task="regression", 
    loss_fxn= "mse",
    scheduler='reduce_lr_on_plateau',
    optimizer="adam",
    metric="spearman",
    optimizer_lr=0.002
)

models.init_weights(model)

from eugene import train
import torch 

train.fit_sequence_module(
    model=model,
    sdata=sdata_train,
    seq_var="ohe_seq",
    target_vars=["Hk_log2_enrichment", "Dev_log2_enrichment"],
    in_memory=True,
    train_var="train_val",
    epochs=10,
    batch_size=128,
    num_workers=2,
    prefetch_factor=2,
    drop_last=False,
    name="DeepSTARR",
    version="Trial5",
    log_dir=settings.logging_dir,
    transforms={"ohe_seq": lambda x: x.swapaxes(1, 2)},  #"target": lambda x: torch.tensor(x, dtype=torch.float32)}
   # early_stopping_metric = "val_loss_epoch",
   # early_stopping_patience = 10
)

This resulted in the following error:

Dropping 0 sequences with NaN targets.
Loading ohe_seq and ['Hk_log2_enrichment', 'Dev_log2_enrichment'] into memory
No seed set
/Users/theeuw01/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /Users/theeuw01/miniconda3/envs/eugene_cuda/lib/pyth ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/theeuw01/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /Users/theeuw01/miniconda3/envs/eugene_cuda/lib/pyth ...
2023-12-10 17:29:58.588960: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-10 17:29:58.589003: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-10 17:29:58.590824: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-10 17:29:58.600109: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-10 17:29:59.952313: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name         | Type             | Params
--------------------------------------------------
0 | arch         | DeepSTARR        | 7.7 M 
1 | train_metric | SpearmanCorrCoef | 0     
2 | val_metric   | SpearmanCorrCoef | 0     
3 | test_metric  | SpearmanCorrCoef | 0     
--------------------------------------------------
7.7 M     Trainable params
0         Non-trainable params
7.7 M     Total params
30.745    Total estimated model params size (MB)
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  4.94it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[33], line 4
      1 from eugene import train
      2 import torch 
----> 4 train.fit_sequence_module(
      5     model=model,
      6     sdata=sdata_train,
      7     seq_var="ohe_seq",
      8     target_vars=["Hk_log2_enrichment", "Dev_log2_enrichment"],
      9     in_memory=True,
     10     train_var="train_val",
     11     epochs=10,
     12     batch_size=128,
     13     num_workers=2,
     14     prefetch_factor=2,
     15     drop_last=False,
     16     name="DeepSTARR",
     17     version="Trial5",
     18     log_dir=settings.logging_dir,
     19     transforms={"ohe_seq": lambda x: x.swapaxes(1, 2)},  #"target": lambda x: torch.tensor(x, dtype=torch.float32)}
     20    # early_stopping_metric = "val_loss_epoch",
     21    # early_stopping_patience = 10
     22 )

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/eugene/train/_fit.py:273, in fit_sequence_module(model, sdata, seq_var, target_vars, in_memory, train_var, epochs, gpus, batch_size, num_workers, prefetch_factor, transforms, drop_last, logger, log_dir, name, version, early_stopping_metric, early_stopping_patience, early_stopping_verbose, model_checkpoint_k, model_checkpoint_monitor, seed, return_trainer, **kwargs)
    270 name = name if name is not None else model_name
    272 # Use the above fit function
--> 273 trainer = fit(
    274     model=model,
    275     train_dataloader=train_dataloader,
    276     val_dataloader=val_dataloader,
    277     epochs=epochs,
    278     gpus=gpus,
    279     logger=logger,
    280     log_dir=log_dir,
    281     name=name,
    282     version=version,
    283     early_stopping_metric=early_stopping_metric,
    284     early_stopping_patience=early_stopping_patience,
    285     early_stopping_verbose=early_stopping_verbose,
    286     model_checkpoint_k=model_checkpoint_k,
    287     model_checkpoint_monitor=model_checkpoint_monitor,
    288     seed=seed,
    289     return_trainer=return_trainer,
    290     **kwargs,
    291 )
    293 if return_trainer:
    294     return trainer

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/eugene/train/_fit.py:123, in fit(model, train_dataloader, val_dataloader, epochs, gpus, logger, log_dir, name, version, early_stopping_metric, early_stopping_patience, early_stopping_verbose, model_checkpoint_k, model_checkpoint_monitor, seed, return_trainer, **kwargs)
    113 trainer = Trainer(
    114     max_epochs=epochs,
    115     logger=logger,
   (...)
    119     **kwargs,
    120 )
    122 # Fit
--> 123 trainer.fit(
    124     model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader
    125 )
    126 if return_trainer:
    127     return trainer

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:544, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    542 self.state.status = TrainerStatus.RUNNING
    543 self.training = True
--> 544 call._call_and_handle_interrupt(
    545     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    546 )

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42     if trainer.strategy.launcher is not None:
     43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:
     47     _call_teardown_hook(trainer)

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:580, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    573 assert self.state.fn is not None
    574 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    575     self.state.fn,
    576     ckpt_path,
    577     model_provided=True,
    578     model_connected=self.lightning_module is not None,
    579 )
--> 580 self._run(model, ckpt_path=ckpt_path)
    582 assert self.state.stopped
    583 self.training = False

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:989, in Trainer._run(self, model, ckpt_path)
    984 self._signal_connector.register_signal_handlers()
    986 # ----------------------------
    987 # RUN THE TRAINER
    988 # ----------------------------
--> 989 results = self._run_stage()
    991 # ----------------------------
    992 # POST-Training CLEAN UP
    993 # ----------------------------
    994 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1033, in Trainer._run_stage(self)
   1031 if self.training:
   1032     with isolate_rng():
-> 1033         self._run_sanity_check()
   1034     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
   1035         self.fit_loop.run()

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1062, in Trainer._run_sanity_check(self)
   1059 call._call_callback_hooks(self, "on_sanity_check_start")
   1061 # run eval step
-> 1062 val_loop.run()
   1064 call._call_callback_hooks(self, "on_sanity_check_end")
   1066 # reset logger connector

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:182, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
    180     context_manager = torch.no_grad
    181 with context_manager():
--> 182     return loop_run(self, *args, **kwargs)

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:141, in _EvaluationLoop.run(self)
    139         self._restarting = False
    140 self._store_dataloader_outputs()
--> 141 return self.on_run_end()

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:253, in _EvaluationLoop.on_run_end(self)
    250 self.trainer._logger_connector._evaluation_epoch_end()
    252 # hook
--> 253 self._on_evaluation_epoch_end()
    255 logged_outputs, self._logged_outputs = self._logged_outputs, []  # free memory
    256 # include any logged outputs on epoch_end

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:331, in _EvaluationLoop._on_evaluation_epoch_end(self)
    328 call._call_callback_hooks(trainer, hook_name)
    329 call._call_lightning_module_hook(trainer, hook_name)
--> 331 trainer._logger_connector.on_epoch_end()

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:187, in _LoggerConnector.on_epoch_end(self)
    185 def on_epoch_end(self) -> None:
    186     assert self._first_loop_iter is None
--> 187     metrics = self.metrics
    188     self._progress_bar_metrics.update(metrics["pbar"])
    189     self._callback_metrics.update(metrics["callback"])

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:226, in _LoggerConnector.metrics(self)
    224 on_step = self._first_loop_iter is not None
    225 assert self.trainer._results is not None
--> 226 return self.trainer._results.metrics(on_step)

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:492, in _ResultCollection.metrics(self, on_step)
    490     # populate progress_bar metrics. convert tensors to numbers
    491     if result_metric.meta.prog_bar:
--> 492         metrics["pbar"][forked_name] = convert_tensors_to_scalars(value)
    494 return metrics

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/utilities/apply_func.py:128, in convert_tensors_to_scalars(data)
    123         raise ValueError(
    124             f"The metric `{value}` does not contain a single element, thus it cannot be converted to a scalar."
    125         )
    126     return value.item()
--> 128 return apply_to_collection(data, Tensor, to_item)

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py:64, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, allow_frozen, *args, **kwargs)
     62 # fast path for the most common cases:
     63 if isinstance(data, dtype):  # single element
---> 64     return function(data, *args, **kwargs)
     65 if isinstance(data, list) and all(isinstance(x, dtype) for x in data):  # 1d homogeneous list
     66     return [function(x, *args, **kwargs) for x in data]

File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/utilities/apply_func.py:123, in convert_tensors_to_scalars.<locals>.to_item(value)
    121 def to_item(value: Tensor) -> Union[int, float, bool]:
    122     if value.numel() != 1:
--> 123         raise ValueError(
    124             f"The metric `{value}` does not contain a single element, thus it cannot be converted to a scalar."
    125         )
    126     return value.item()

ValueError: The metric `tensor([-0.0243, -0.0024], device='cuda:0')` does not contain a single element, thus it cannot be converted to a scalar.
adamklie commented 11 months ago

Thanks for trying out the package!

This stems from an annoying nuance in using "spearman" as the metric in the SequenceModule (SpearmanCorrCoef). It won't automatically average the metric across tasks like R2Score does, so PyTorch Lightning complains when you give it a 2D Tensor of correlations to log.

After playing with it for a bit, I wasn't able to get a workaround for using "spearman" with the current release on PyPI. For now, I'd recommend just training with "r2score" as the passed in metric.

model = models.SequenceModule(
    arch=model,
    task="regression", 
    loss_fxn= "mse",
    scheduler='reduce_lr_on_plateau',
    optimizer="adam",
    metric="r2score",
    optimizer_lr=0.002
)

The metric doesn't affect the fitting of the model, but you won't get to see how Spearman changes across training. You can always calculate it on the test set post-hoc and I'm going to clean up how metrics are handled in a future release so this isn't an issue.

I was able to reproduce the issue with the DeepSTARR zarr files. There's a bug that I will patch and you can install SeqDatasets from source if you want to go that route (this line shouldn't exist: (https://github.com/ML4GLand/SeqDatasets/blob/main/seqdatasets/_datasets.py#L318C13-L318C49). However, your workaround should work just fine.

Sorry about the difficulty with concatenating objects. I love XArray, but it's not the most intuitive to use. This is how I would concatenate two SeqDatas:

import xarray as xr
sdata_train["train_val"] = True
sdata_val["train_val"] = False
sdata_training = xr.concat([sdata_train, sdata_val], dim="_sequence")
BartTheeuwes commented 11 months ago

Thank you Adam,

After looking around on the ML4GLand I came across the use_cases repository. It seems like these are really useful tutorials, I'm running it for the DeepSTARR using eugene now, using the following notebooks: https://github.com/ML4GLand/use_cases/tree/main/DeepSTARR/eugene

There are some minor errors in the code that I'm tweaking one by one. The first 2 notebooks now run fine, I'm testing the attribution analysis now. I noticed that the original Eugene documentation does not link to the use_cases repository, I think it might be very helpful for new users to have tutorial more easily accessible.

Once again, thank you for creating this package.

adamklie commented 11 months ago

Great, glad you found those useful. Some were created with an earlier version of the package and do need to be updated. If you end up with a set of working notebooks, feel free to submit a pull request! :)

That's a great suggestion, I will add the link to the landing page of the documentation.

Going to close this for now, but feel free to reopen if you run into any other sticking points!