georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
https://multimodal-toolkit.readthedocs.io
Apache License 2.0
587 stars 84 forks source link

feat: add support for longformers #10

Closed sidharrth2002 closed 1 year ago

sidharrth2002 commented 2 years ago

Added support for longformers through the LongformerWithTabular class.

akashsaravanan-georgian commented 1 year ago

Hi @sidharrth2002, Thank you for the PR and apologies for the delay in responding. Unfortunately it seems like there have been some changes to the Transformers library in the time in-between. Thus the code doesn't work right now. Please feel free to update the code and I'll be happy to merge the changes :)

jtfields commented 1 year ago

I am interested in using the Multimodal Toolkit with Longformer. Has support been added for this?

akashsaravanan-georgian commented 1 year ago

Hi @jtfields, at the moment we do not support Longformers. This PR does have some base code that could be used but is currently outdated.

jtfields commented 1 year ago

I have forked the code from sidharrth2002 with support for longformers and am currently testing with the text_w_tabular_classification.ipynb notebook. I changed the model_args to:

model_args = ModelArguments(model_name_or_path='allenai/longformer-base-4096')

This works but I receive an error in the next step, "Load dataset csvs to torch datasets". Here, I receive this error: KeyError Traceback (most recent call last) /usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise KeyError: 'explanation_practice'

Any suggestion for how to fix this error?

akashsaravanan-georgian commented 1 year ago

Hm, the error you're receiving makes me think there's an issue with the columns you may have specified. Are you running this on one of the datasets mentioned in the repo or with a custom dataset?

In either case, the error seems to be that the column explanation_practice is missing. Could you check if it's specified anywhere in the definition of text_cols, cat_cols, numerical_cols or column_info_dict?

jtfields commented 1 year ago

I'm testing with the Womens_Clothing_E-Commerce_Reviews dataset prior to using on a proprietary dataset. The Women's Clothing dataset does not contain a column 'explanation_practice'. Here are the args.

data_args MultimodalDataTrainingArguments(data_path='.', column_info_path=None, column_info={'text_cols': ['Title', 'Review Text'], 'num_cols': ['Rating', 'Age', 'Positive Feedback Count'], 'cat_cols': ['Clothing ID', 'Division Name', 'Department Name', 'Class Name'], 'label_col': 'Recommended IND', 'label_list': ['Not Recommended', 'Recommended']}, categorical_encode_type='ohe', numerical_transformer_method='yeo_johnson', task='classification', mlp_division=4, combine_feat_method='individual_mlps_on_cat_and_numerical_feats_then_concat', mlp_dropout=0.1, numerical_bn=True, use_simple_classifier=True, mlp_act='relu', gating_beta=0.2)

model_args ModelArguments(model_name_or_path='allenai/longformer-base-4096', config_name=None, tokenizer_name=None, cache_dir=None)

training_args TrainingArguments(output_dir='./logs/model_name', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.EPOCH: 'epoch'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1, max_steps=-1, warmup_steps=0, logging_dir='./logs/runs', logging_first_step=False, logging_steps=25, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=250, dataloader_num_workers=0, past_index=-1, run_name='./logs/model_name', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False)

akashsaravanan-georgian commented 1 year ago

That's very odd. Could you share the entire code you used for the women's clothing dataset?

jtfields commented 1 year ago

I loaded the modified package to Colab using pip install git+https://github.com/sidharrth2002/Multimodal-Toolkit. However, this version is 36 commits behind georgian-io:master. Should I fork the current version of Multimodal Toolkit and update with sidharrth's modifications or is there a more efficient way to do this?

jtfields commented 1 year ago

I forked the current version of Multimodal-Toolkit and modified the code for longformer support at https://github.com/jtfields/Multimodal-Toolkit-Longformer. I'm now receiving an error earlier in the code - "NameError: name 'add_start_docstrings' is not defined". This occurs when I execute this section:

from dataclasses import dataclass, field import json import logging import os from typing import Optional import numpy as np import pandas as pd from transformers import ( AutoTokenizer, AutoConfig, Trainer, EvalPrediction, set_seed ) from transformers.training_args import TrainingArguments from multimodal_transformers.data import load_data_from_folder from multimodal_transformers.model import TabularConfig from multimodal_transformers.model import AutoModelWithTabular logging.basicConfig(level=logging.INFO) os.environ['COMET_MODE'] = 'DISABLED'

akashsaravanan-georgian commented 1 year ago

Hey @jtfields, so you're on the right track with forking the current version of the repo and adding in the changes from this PR. All that's left is to update it so it matches the current version of transformers. You're receiving this error because add_start_docstrings is no longer in use. You likely need to use @add_start_docstrings_to_model_forward instead. Refer to how the other models are implemented in the same file to get a hang of things. You should be good if you do the same for the other files changed in this PR. I don't think you'll need to do many other changes to get Longformers running.

jtfields commented 1 year ago

I have resolved the add_start_docstrings issue and now have a new error in tabular_combiner.py. Below are the error messages when using torch.cat and torch.stack in tabular_combiner.py for lines 426-428.

Colab Code - Women's Ecommerce Clothing Reviews %%time trainer.train()

OUTPUT WITH torch.cat

Initializing global attention on CLS token...

RuntimeError Traceback (most recent call last)

in [/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1660 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size 1661 ) -> 1662 return inner_training_loop( 1663 args=args, 1664 resume_from_checkpoint=resume_from_checkpoint, 6 frames [/usr/local/lib/python3.10/dist-packages/multimodal_transformers/model/tabular_combiner.py](https://localhost:8080/#) in forward(self, text_feats, cat_feats, numerical_feats) 424 if numerical_feats.shape[1] != 0: 425 numerical_feats = self.num_mlp(numerical_feats) --> 426 **combined_feats = torch.cat((text_feats, cat_feats, numerical_feats), dim=1)** 427 #combined_feats = torch.cat((text_feats, cat_feats, numerical_feats), dim=-1) 428 #combined_feats = torch.stack((text_feats, cat_feats, numerical_feats), dim=1) **RuntimeError: Tensors must have same number of dimensions: got 3 and 2** OUTPUT WITH torch.stack Initializing global attention on CLS token... --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in [/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1660 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size 1661 ) -> 1662 return inner_training_loop( 1663 args=args, 1664 resume_from_checkpoint=resume_from_checkpoint, 6 frames [/usr/local/lib/python3.10/dist-packages/multimodal_transformers/model/tabular_combiner.py](https://localhost:8080/#) in forward(self, text_feats, cat_feats, numerical_feats) 426 #combined_feats = torch.cat((text_feats, cat_feats, numerical_feats), dim=1) 427 #combined_feats = torch.cat((text_feats, cat_feats, numerical_feats), dim=-1) --> 428 **combined_feats = torch.stack((text_feats, cat_feats, numerical_feats), dim=1)** 429 elif ( 430 self.combine_feat_method **RuntimeError: stack expects each tensor to be equal size, but got [32, 2, 768] at entry 0 and [32, 43] at entry 1**
akashsaravanan-georgian commented 1 year ago

Hey @jtfields, I haven't had a chance to look at your code but judging by the error, it sounds like longformers might have an additional step required. Specifically, your forward() method returns a different than expected shape.

Looking at this in particular: RuntimeError: stack expects each tensor to be equal size, but got [32, 2, 768] at entry 0 and [32, 43] at entry 1

It looks like your outputs are of the shape (batch_size, sequence_length, embedding_dim). This corresponds to having an embedding for every word in the output I.E., word embeddings. However, what we want is a sentence embedding where we have one embedding for every sentence (or paragraph). So instead, the shape we want is (batch_size, embedding_dim).

Unfortunately there's no ready answer I have on how to get that. Different models have different best practices. BERT-based models use the embedding of the [CLS] token to get sentence embeddings, while others such as XLM use an additional layer to do this task (see the sequence_summary bits in multimodal_transformers/model/tabular_transformers.py). I'm not familiar with longformers so I can't tell you exactly what to do, but I'm sure that there's a standard method people use for it.

jtfields commented 1 year ago

Thank you for the feedback on the sentence vs word level embeddings. In the HuggingFace file transformers/modeling_longformer.py, there is class LongformerClassificationHead. Do you recommend changing LongformerForSequenceClassification to LongformClassificationHead. According to the notes, this is the head for sentence-level classification tasks. Or should I focus on a different tokenizing method?

class LongformerClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, hidden_states, **kwargs):
    hidden_states = hidden_states[:, 0, :]  # take s token (equiv. to [CLS])
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.dense(hidden_states)
    hidden_states = torch.tanh(hidden_states)
    hidden_states = self.dropout(hidden_states)
    output = self.out_proj(hidden_states)
    return output
akashsaravanan-georgian commented 1 year ago

I believe LongformerClassificationHead is a part of the LongformerForSequenceClassification model. Specifically, it is the head that performs the final classification bit. Looking at the HF documentation, it seems like Longformers are based on Roberta. I'd suggest just imitating what Roberta does in this codebase and see if that works.

jtfields commented 1 year ago

Thanks for the suggestion to mimic the Roberta code. I did this in tabular_transformers.py and it now completes 1 epoch but fails when calling the numpy file fromnumeric.py.

Thank you for all your help. I feel that we are very close to making longformers work!

Here is the new error...

[588/588 12:51, Epoch 1/1] Epoch Training Loss Validation Loss [294/294 00:25] /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:43: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = getattr(asarray(obj), method)(*args, **kwds)

AxisError Traceback (most recent call last)

in [/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1660 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size 1661 ) -> 1662 return inner_training_loop( 1663 args=args, 1664 resume_from_checkpoint=resume_from_checkpoint, 8 frames [/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py](https://localhost:8080/#) in _wrapit(obj, method, *args, **kwds) 41 except AttributeError: 42 wrap = None ---> 43 result = getattr(asarray(obj), method)(*args, **kwds) 44 if wrap: 45 if not isinstance(result, mu.ndarray): AxisError: axis 1 is out of bounds for array of dimension 1
jtfields commented 1 year ago

I see that someone has logged a new issue, "Please check the colab notebook #43", which is the same error that I am receiving.

jtfields commented 1 year ago

Some StackOverflow posts suggest changing to axis=0 in this line of code to correct the error: pred_labels = np.argmax(predictions, axis=1)

I tried this with the E-commerce notebook (with my longformer changes) and received this new error which I believe is occurring in def calc_classification_metrics(p: EvalPrediction):

ValueError Traceback (most recent call last)

in [/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1660 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size 1661 ) -> 1662 return inner_training_loop( 1663 args=args, 1664 resume_from_checkpoint=resume_from_checkpoint, 8 frames [/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py](https://localhost:8080/#) in _wrapit(obj, method, *args, **kwds) 41 except AttributeError: 42 wrap = None ---> 43 result = getattr(asarray(obj), method)(*args, **kwds) 44 if wrap: 45 if not isinstance(result, mu.ndarray): ValueError: could not broadcast input array from shape (2349,1388) into shape (2349,)
jtfields commented 1 year ago

I think this might be related to Issue #41. Testing now...

akashsaravanan-georgian commented 1 year ago

I think you're right in that it's related to the previous issue. Let me know how it goes!

jtfields commented 1 year ago

The Multimodal-Toolkit is now working with Longformer! Here are the results from the E-Commerce notebook:

Screenshot 2023-05-18 at 10 33 53 AM
akashsaravanan-georgian commented 1 year ago

Hey @jtfields, that's great to hear!! Thank you for your hard work. I'd really appreciate it if you could make a new pull request with the longformer changes.

jtfields commented 1 year ago

Do you want me to add a new pull request or put the longformer changes in the existing pull request #10?

akashsaravanan-georgian commented 1 year ago

I think a new PR would be better since you've made significant changes!