Closed kingafy closed 2 years ago
The only changes you should have to make are:
max_length
if you want to train on longer contextstransformer_model
to a HF model that supports longer input sequences (e.g. allenai/longformer-base-4096).Of course I could be forgetting something. Feel free to follow up here if these changes cause an error.
I tried integrating but am getting issue while integrating allenai/longformer-large-4096
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 80, in __iter__
for instance in self._instance_generator(self._file_path):
File "/usr/local/lib/python3.7/dist-packages/allennlp/data/dataset_readers/dataset_reader.py", line 446, in _instance_iterator
yield from self._multi_worker_islice(self._read(file_path), ensure_lazy=True)
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1180, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/declutr/dataset_reader.py", line 124, in _read
file_path = cached_path(file_path)
File "/usr/local/lib/python3.7/dist-packages/allennlp/common/file_utils.py", line 175, in cached_path
raise FileNotFoundError(f"file {url_or_filename} not found")
FileNotFoundError: file None not found
Also when I downloaded the model in local in colab and referenced the path, I am able to use the model but then it stops with error
File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/seq2vec_encoders/boe_encoder.py", line 43, in forward
tokens = tokens * mask.unsqueeze(-1)
RuntimeError: The size of tensor a (512) must match the size of tensor b (503) at non-singleton dimension 1
The first error looks like one of your datapaths is wrong or non-existent (train_data_path
or validation_data_path
).
Your second, I am not sure. At some point AllenNLP tries to multiply some tokens of size 512 with a mask of size 503. You may have to look into the specifics of longformer to see how to use it properly, or try another model that accepts long input sequences.
Unfortunately the clarity is not there how to train longer sequence models using Declutr on allen NLP. Is there a way I can use Declutr base on longer context sequence with some striding parameters or any other way to handle longer sequence for semantic tasks.
Could you provide more detail as to exactly what you are trying to do? Are you trying to re-train the model with a longer max_length
? Or are you trying to use a trained model on longer sequences?
I want to handle longer sequences for semantic tasks.SO there are two approaches:-
Training on longer sequences is going to be tricky as you would need to collect even longer training documents.
I would go with approach 2. AlleNLP has support for chunking up some text into blocks of max_length
, embedding each, and then concatenating the embeddings. The general approach would be:
max_length
argument of the tokenizer to the maximum length of documents you want to embed.max_length
argument of the PretrainedTransformerIndexer
to 512 (the maximum input size of our pretrained transformer)max_length
argument of the PretrainedTransformerEmbedder
to 512 (the maximum input size of our pretrained transformer)The setup would go something like:
from declutr import Encoder
overrides = '{"dataset_reader.tokenizer.max_length": 1024, "dataset_reader.token_indexers.tokens.max_length": 512, "model.text_field_embedder.token_embedders.tokens.max_length": 512}'
encoder = Encoder("declutr-base", overrides=overrides)
text = " ".join(["this is a very long string"] * 2048)
encoder(text)
I have not extensively tested this code but it should run and be enough to get started. In particular you should check out the max_length
argument of PretrainedTransformerIndexer
and PretrainedTransformerEmbedder
for more details.
Closing, please feel free to re-open and file another issues if you questions are not answered.
I am planning to have a longer sequence based model into Declutr so that similar context tasks the limitations are not present on the sequence length part. Is there a limitation to implement?