Closed sidharrth2002 closed 1 year ago
Hi @sidharrth2002, Thank you for the PR and apologies for the delay in responding. Unfortunately it seems like there have been some changes to the Transformers library in the time in-between. Thus the code doesn't work right now. Please feel free to update the code and I'll be happy to merge the changes :)
I am interested in using the Multimodal Toolkit with Longformer. Has support been added for this?
Hi @jtfields, at the moment we do not support Longformers. This PR does have some base code that could be used but is currently outdated.
I have forked the code from sidharrth2002 with support for longformers and am currently testing with the text_w_tabular_classification.ipynb notebook. I changed the model_args to:
model_args = ModelArguments(model_name_or_path='allenai/longformer-base-4096')
This works but I receive an error in the next step, "Load dataset csvs to torch datasets". Here, I receive this error: KeyError Traceback (most recent call last) /usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise KeyError: 'explanation_practice'
Any suggestion for how to fix this error?
Hm, the error you're receiving makes me think there's an issue with the columns you may have specified. Are you running this on one of the datasets mentioned in the repo or with a custom dataset?
In either case, the error seems to be that the column explanation_practice
is missing. Could you check if it's specified anywhere in the definition of text_cols
, cat_cols
, numerical_cols
or column_info_dict
?
I'm testing with the Womens_Clothing_E-Commerce_Reviews dataset prior to using on a proprietary dataset. The Women's Clothing dataset does not contain a column 'explanation_practice'. Here are the args.
data_args MultimodalDataTrainingArguments(data_path='.', column_info_path=None, column_info={'text_cols': ['Title', 'Review Text'], 'num_cols': ['Rating', 'Age', 'Positive Feedback Count'], 'cat_cols': ['Clothing ID', 'Division Name', 'Department Name', 'Class Name'], 'label_col': 'Recommended IND', 'label_list': ['Not Recommended', 'Recommended']}, categorical_encode_type='ohe', numerical_transformer_method='yeo_johnson', task='classification', mlp_division=4, combine_feat_method='individual_mlps_on_cat_and_numerical_feats_then_concat', mlp_dropout=0.1, numerical_bn=True, use_simple_classifier=True, mlp_act='relu', gating_beta=0.2)
model_args ModelArguments(model_name_or_path='allenai/longformer-base-4096', config_name=None, tokenizer_name=None, cache_dir=None)
training_args TrainingArguments(output_dir='./logs/model_name', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.EPOCH: 'epoch'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1, max_steps=-1, warmup_steps=0, logging_dir='./logs/runs', logging_first_step=False, logging_steps=25, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=250, dataloader_num_workers=0, past_index=-1, run_name='./logs/model_name', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False)
That's very odd. Could you share the entire code you used for the women's clothing dataset?
I loaded the modified package to Colab using pip install git+https://github.com/sidharrth2002/Multimodal-Toolkit. However, this version is 36 commits behind georgian-io:master. Should I fork the current version of Multimodal Toolkit and update with sidharrth's modifications or is there a more efficient way to do this?
I forked the current version of Multimodal-Toolkit and modified the code for longformer support at https://github.com/jtfields/Multimodal-Toolkit-Longformer. I'm now receiving an error earlier in the code - "NameError: name 'add_start_docstrings' is not defined". This occurs when I execute this section:
from dataclasses import dataclass, field import json import logging import os from typing import Optional import numpy as np import pandas as pd from transformers import ( AutoTokenizer, AutoConfig, Trainer, EvalPrediction, set_seed ) from transformers.training_args import TrainingArguments from multimodal_transformers.data import load_data_from_folder from multimodal_transformers.model import TabularConfig from multimodal_transformers.model import AutoModelWithTabular logging.basicConfig(level=logging.INFO) os.environ['COMET_MODE'] = 'DISABLED'
Hey @jtfields, so you're on the right track with forking the current version of the repo and adding in the changes from this PR. All that's left is to update it so it matches the current version of transformers. You're receiving this error because add_start_docstrings
is no longer in use. You likely need to use @add_start_docstrings_to_model_forward
instead. Refer to how the other models are implemented in the same file to get a hang of things. You should be good if you do the same for the other files changed in this PR. I don't think you'll need to do many other changes to get Longformers running.
I have resolved the add_start_docstrings issue and now have a new error in tabular_combiner.py. Below are the error messages when using torch.cat and torch.stack in tabular_combiner.py for lines 426-428.
Colab Code - Women's Ecommerce Clothing Reviews %%time trainer.train()
OUTPUT WITH torch.cat
RuntimeError Traceback (most recent call last)
Hey @jtfields, I haven't had a chance to look at your code but judging by the error, it sounds like longformers might have an additional step required. Specifically, your forward() method returns a different than expected shape.
Looking at this in particular:
RuntimeError: stack expects each tensor to be equal size, but got [32, 2, 768] at entry 0 and [32, 43] at entry 1
It looks like your outputs are of the shape (batch_size, sequence_length, embedding_dim). This corresponds to having an embedding for every word in the output I.E., word embeddings. However, what we want is a sentence embedding where we have one embedding for every sentence (or paragraph). So instead, the shape we want is (batch_size, embedding_dim).
Unfortunately there's no ready answer I have on how to get that. Different models have different best practices. BERT-based models use the embedding of the [CLS] token to get sentence embeddings, while others such as XLM use an additional layer to do this task (see the sequence_summary
bits in multimodal_transformers/model/tabular_transformers.py
). I'm not familiar with longformers so I can't tell you exactly what to do, but I'm sure that there's a standard method people use for it.
Thank you for the feedback on the sentence vs word level embeddings. In the HuggingFace file transformers/modeling_longformer.py, there is class LongformerClassificationHead. Do you recommend changing LongformerForSequenceClassification to LongformClassificationHead. According to the notes, this is the head for sentence-level classification tasks. Or should I focus on a different tokenizing method?
class LongformerClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, hidden_states, **kwargs):
hidden_states = hidden_states[:, 0, :] # take s token (equiv. to [CLS])
hidden_states = self.dropout(hidden_states)
hidden_states = self.dense(hidden_states)
hidden_states = torch.tanh(hidden_states)
hidden_states = self.dropout(hidden_states)
output = self.out_proj(hidden_states)
return output
I believe LongformerClassificationHead
is a part of the LongformerForSequenceClassification
model. Specifically, it is the head that performs the final classification bit. Looking at the HF documentation, it seems like Longformers are based on Roberta. I'd suggest just imitating what Roberta does in this codebase and see if that works.
Thanks for the suggestion to mimic the Roberta code. I did this in tabular_transformers.py and it now completes 1 epoch but fails when calling the numpy file fromnumeric.py.
Thank you for all your help. I feel that we are very close to making longformers work!
Here is the new error...
AxisError Traceback (most recent call last)
I see that someone has logged a new issue, "Please check the colab notebook #43", which is the same error that I am receiving.
Some StackOverflow posts suggest changing to axis=0 in this line of code to correct the error: pred_labels = np.argmax(predictions, axis=1)
ValueError Traceback (most recent call last)
I think this might be related to Issue #41. Testing now...
I think you're right in that it's related to the previous issue. Let me know how it goes!
The Multimodal-Toolkit is now working with Longformer! Here are the results from the E-Commerce notebook:
Hey @jtfields, that's great to hear!! Thank you for your hard work. I'd really appreciate it if you could make a new pull request with the longformer changes.
Do you want me to add a new pull request or put the longformer changes in the existing pull request #10?
I think a new PR would be better since you've made significant changes!
Added support for longformers through the
LongformerWithTabular
class.