deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

Error while using Farm 0.7.1, index is out of bounds #806

Closed aloizel closed 3 years ago

aloizel commented 3 years ago

Hi guys, I'm using farm 0.7.1 on a databricks cluster, on a cpu instance (I know that gpu will be better for this but we don't have one at the moment) and I'm having trouble but I don't know if the problem is from farm or torch

My problem looks like this issue from torch : https://github.com/pytorch/pytorch/issues/15508 but it should be corrected

I'm on torch 1.7.1 because farm doesn't allow me to use a more recent version

When I launch a training I got this error message :

    232         infer_model = Inferencer(
    233             processor=processor,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/farm/train.py in train(self)
    299                 # Forward & backward pass through model
    300                 logits = self.model.forward(**batch)
--> 301                 per_sample_loss = self.model.logits_to_loss(logits=logits, global_step=self.global_step, **batch)
    302                 loss = self.backward_propagate(per_sample_loss, step)
    303 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/farm/modeling/adaptive_model.py in logits_to_loss(self, logits, global_step, **kwargs)
    379         :return loss: torch.tensor that is the per sample loss (len: batch_size)
    380         """
--> 381         all_losses = self.logits_to_loss_per_head(logits, **kwargs)
    382         # This aggregates the loss per sample across multiple prediction heads
    383         # Default is sum(), but you can configure any fn that takes [Tensor, Tensor ...] and returns [Tensor]

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/farm/modeling/adaptive_model.py in logits_to_loss_per_head(self, logits, **kwargs)
    363                 " with the processor through either 'model.connect_heads_with_processor(processor.tasks)'"
    364                 " or by passing the processor to the Adaptive Model?")
--> 365             all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs))
    366         return all_losses
    367 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/farm/modeling/prediction_head.py in logits_to_loss(self, logits, **kwargs)
    356         label_ids = kwargs.get(self.label_tensor_name)
    357         label_ids = label_ids
--> 358         return self.loss_fct(logits, label_ids.view(-1))
    359 
    360     def logits_to_probs(self, logits, return_class_probs, **kwargs):

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    959 
    960     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961         return F.cross_entropy(input, target, weight=self.weight,
    962                                ignore_index=self.ignore_index, reduction=self.reduction)
    963 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-0dc05d70-b778-450c-862e-8696dd8c83c5/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:
-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

IndexError: Target 170 is out of bounds.

The Target is not always 170 it depends of the run and the model I chose to train

Have you already had this problem ?

If you need anything more to answer my question just ask

Databricks runtime version : 8.2 (includes Apache Spark 3.1.1, Scala 2.12) Worker Type Standard_DS3_v2

Thanks

Timoeller commented 3 years ago

Hey @aloizel what are you using FARM for inside databricks (notebook)? Sound amazing.

We have recently tested torch 1.8.1 in https://github.com/deepset-ai/FARM/pull/767 and also very recently released a new FARM version https://github.com/deepset-ai/FARM/releases/tag/v0.8.0

At a first glance it looks more like a torch issue, but lets figure this out together. Could you try updating and report back?

aloizel commented 3 years ago

Hi ! Thanks for the answer

I found the problem it wasn't due to farm it's ok I close the issue

And yes we are using databricks but we use a library to launch our code as a job not as a notebook, it allows us to scheduled our train and follow our data life cycle