Error with train_on_task.py

gonzalobenegas commented 11 months ago

Thank you so much for this resource! I really appreciate the thoughtful task choice and broad coverage of models.

I'm trying out an example model and task:

Downloaded the data from zenodo, as it was easier. Is it updated? Do you have a script to download data?
Successfully ran python scripts/precompute_embeddings.py model=resnetlm task=gene_finding
I'm having issues with python scripts/train_on_task.py --config-name gene_finding embedder=resnetlm

(BEND) gbenegas@luthien:/scratch/users/gbenegas/projects/BEND$ python scripts/train_on_task.py --config-name gene_finding embedder=resnetlm                                                         
Run experiment                                                                                                                                                                                      
[2023-10-19 18:30:35,520][HYDRA] Joblib.Parallel(n_jobs=1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 1 jobs       
[2023-10-19 18:30:35,520][HYDRA] Launching jobs, sweep output dir : multirun/2023-10-19/18-30-35                                                                                           [22/1602]
[2023-10-19 18:30:35,520][HYDRA]        #0 : embedder=resnetlm                                                                                                                                      
output_dir ./downstream_tasks/gene_finding/resnetlm/                                                                                                                                                
device cuda                                                                                                                                                                                         
{'_target_': 'bend.models.downstream.CNN', 'input_size': '${datadims.${embedder}}', 'output_size': '${datadims.${task}}', 'hidden_size': 64, 'kernel_size': 3}                                      
CNN(                                                                                                                                                                                                
  (onehot_embedding): OneHotEmbedding(hidden_size=256)                                                                                                                                              
  (conv1): Sequential(                                                                                                                                                                              
    (0): TransposeLayer()                                                                                                                                                                           
    (1): Conv1d(256, 64, kernel_size=(3,), stride=(1,), padding=(1,))                                                                                                                               
    (2): TransposeLayer()                                                                                                                                                                           
    (3): GELU(approximate='none')                                                                                                                                                                   
  )                                                                                                                                                                                                 
  (conv2): Sequential(                                                                                                                                                                              
    (0): TransposeLayer()                                                                                                                                                                           
    (1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,))                                                                                                                                
    (2): TransposeLayer()                                                                                                                                                                           
    (3): GELU(approximate='none')                                                                                                                                                                   
  )                                                                                                                                                                                                 
  (linear): Sequential(                                                                                                                                                                             
    (0): Linear(in_features=64, out_features=9, bias=True)                                                                                                                                          
  )                                                                                                                                                                                                 
  (softmax): Softmax(dim=-1)                                                                                                                                                                        
  (softplus): Softplus(beta=1, threshold=20)
  (sigmoid): Sigmoid()
)
Use cross_entropy loss function
Training
Not looking for existing checkpoints, starting from scratch.
3it [00:10,  3.66s/it]
Error executing job with overrides: ['embedder=resnetlm']
Traceback (most recent call last):
  File "/scratch/users/gbenegas/projects/BEND/scripts/train_on_task.py", line 83, in run_experiment
    trainer.train(train_loader, val_loader, cfg.params.epochs, cfg.params.load_checkpoint)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 393, in train
    train_loss = self.train_epoch(train_loader)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 353, in train_epoch
    train_loss += self.train_step(batch, idx = idx)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 429, in train_step 
    loss = self.criterion(output, target.to(self.device).long())
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 60, in forward
    return self.criterion(pred.permute(0, 2, 1), target)
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1179, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/scratch/users/gbenegas/software/mambaforge/envs/BEND/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [64, 12728], got [64, 12732]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

frederikkemarin commented 11 months ago

We are glad that you're finding the repo useful.

Thank you for pointing out the bug. It has been fixed so you should now be able to run the gene_finding task after recomputing the embeddings.

We are working on a script to download the data more easily.

gonzalobenegas commented 11 months ago

Thanks for your quick reply.

I'm now getting another error, seems to be in the validation loop.

(BEND) gbenegas@luthien:/scratch/users/gbenegas/projects/BEND$ python scripts/train_on_task.py --config-name gene_finding embedder=resnetlm                                                        
Run experiment
[2023-10-20 10:38:07,806][HYDRA] Joblib.Parallel(n_jobs=1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 1 jobs
[2023-10-20 10:38:07,806][HYDRA] Launching jobs, sweep output dir : multirun/2023-10-20/10-38-07
[2023-10-20 10:38:07,806][HYDRA]        #0 : embedder=resnetlm
output_dir ./downstream_tasks/gene_finding/resnetlm/
device cuda
{'_target_': 'bend.models.downstream.CNN', 'input_size': '${datadims.${embedder}}', 'output_size': '${datadims.${task}}', 'hidden_size': 64, 'kernel_size': 3}
CNN(
  (onehot_embedding): OneHotEmbedding(hidden_size=256)
  (conv1): Sequential(
    (0): TransposeLayer()
    (1): Conv1d(256, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (2): TransposeLayer()
    (3): GELU(approximate='none')
  )
  (conv2): Sequential(
    (0): TransposeLayer()
    (1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (2): TransposeLayer()
    (3): GELU(approximate='none')
  )
  (linear): Sequential(
    (0): Linear(in_features=64, out_features=9, bias=True)
  )
  (softmax): Softmax(dim=-1)
  (softplus): Softplus(beta=1, threshold=20)
  (sigmoid): Sigmoid()
)
Use cross_entropy loss function
Training
Not looking for existing checkpoints, starting from scratch.
75it [08:21,  6.69s/it]
Error executing job with overrides: ['embedder=resnetlm']
Traceback (most recent call last):
  File "/scratch/users/gbenegas/projects/BEND/scripts/train_on_task.py", line 83, in run_experiment
    trainer.train(train_loader, val_loader, cfg.params.epochs, cfg.params.load_checkpoint)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 394, in train
    val_loss, val_metric = self.validate(val_loader)
  File "/scratch/users/gbenegas/projects/BEND/bend/utils/task_trainer.py", line 472, in validate
    metric = self._calculate_metric(torch.cat(targets_all),
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 12949 but got size 12997 for tensor number 1 in the list.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

frederikkemarin commented 11 months ago

Thank you for your patience. This bug has now been fixed as well.

gonzalobenegas commented 11 months ago

Thank you so much! I was able to run it properly.

frederikkemarin / BEND

Error with train_on_task.py #44