How to integrate a longer sequence model like longformer into declutr architecture

kingafy commented 2 years ago

I am planning to have a longer sequence based model into Declutr so that similar context tasks the limitations are not present on the sequence length part. Is there a limitation to implement?

JohnGiorgi commented 2 years ago

The only changes you should have to make are:

Update max_length if you want to train on longer contexts
Update transformer_model to a HF model that supports longer input sequences (e.g. allenai/longformer-base-4096).

Of course I could be forgetting something. Feel free to follow up here if these changes cause an error.

kingafy commented 2 years ago

I tried integrating but am getting issue while integrating allenai/longformer-large-4096

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 80, in __iter__
    for instance in self._instance_generator(self._file_path):
  File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 446, in _instance_iterator
    yield from self._multi_worker_islice(self._read(file_path), ensure_lazy=True)
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/declutr/dataset_reader.py", line 124, in _read
    file_path = cached_path(file_path)
  File "/usr/local/lib/python3.7/dist-packages/allennlp/common/file_utils.py", line 175, in cached_path
    raise FileNotFoundError(f"file {url_or_filename} not found")
FileNotFoundError: file None not found

Also when I downloaded the model in local in colab and referenced the path, I am able to use the model but then it stops with error

 File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/seq2vec_encoders/boe_encoder.py", line 43, in forward
    tokens = tokens * mask.unsqueeze(-1)
RuntimeError: The size of tensor a (512) must match the size of tensor b (503) at non-singleton dimension 1

JohnGiorgi commented 2 years ago

The first error looks like one of your datapaths is wrong or non-existent (train_data_path or validation_data_path).

Your second, I am not sure. At some point AllenNLP tries to multiply some tokens of size 512 with a mask of size 503. You may have to look into the specifics of longformer to see how to use it properly, or try another model that accepts long input sequences.

kingafy commented 2 years ago

Unfortunately the clarity is not there how to train longer sequence models using Declutr on allen NLP. Is there a way I can use Declutr base on longer context sequence with some striding parameters or any other way to handle longer sequence for semantic tasks.

JohnGiorgi commented 2 years ago

Could you provide more detail as to exactly what you are trying to do? Are you trying to re-train the model with a longer max_length? Or are you trying to use a trained model on longer sequences?

kingafy commented 2 years ago

I want to handle longer sequences for semantic tasks.SO there are two approaches:-

Use a declutr objective on long sequence model to support longer seq.
Use the declutr on roberta base or roberta large but have some sort of striding logic so that longer sequences input during inferencing does not create bottlenecks. Hope this explains the task in hand.

JohnGiorgi commented 2 years ago

Training on longer sequences is going to be tricky as you would need to collect even longer training documents.

I would go with approach 2. AlleNLP has support for chunking up some text into blocks of max_length, embedding each, and then concatenating the embeddings. The general approach would be:

Set the max_length argument of the tokenizer to the maximum length of documents you want to embed.
Set the max_length argument of the PretrainedTransformerIndexer to 512 (the maximum input size of our pretrained transformer)
Set the max_length argument of the PretrainedTransformerEmbedder to 512 (the maximum input size of our pretrained transformer)

The setup would go something like:

from declutr import Encoder

overrides = '{"dataset_reader.tokenizer.max_length": 1024, "dataset_reader.token_indexers.tokens.max_length": 512, "model.text_field_embedder.token_embedders.tokens.max_length": 512}'

encoder = Encoder("declutr-base", overrides=overrides)

text = " ".join(["this is a very long string"] * 2048)
encoder(text)

I have not extensively tested this code but it should run and be enough to get started. In particular you should check out the max_length argument of PretrainedTransformerIndexer and PretrainedTransformerEmbedder for more details.

JohnGiorgi commented 2 years ago

Closing, please feel free to re-open and file another issues if you questions are not answered.

JohnGiorgi / DeCLUTR

How to integrate a longer sequence model like longformer into declutr architecture #244