Impact of "shorter" documents (span, number of tokens) for extended pretraining

repodiac commented 3 years ago

I am currently trying to use DeCLUTR for extended pretraining in a multilingual setting for documents of a domain-specific purpose.

I chose to use sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and have approximately 100k documents of varying length. Thus, I quickly ran into the "known" token/span length error (see https://github.com/JohnGiorgi/DeCLUTR/blob/73a19313cd1707ce6a7a678451f41a3091205d4e/declutr/common/contrastive_utils.py#L48)

Since I cannot change the data basis I have now, I tried to adjust the span lengths, namely max_length and min_length in the config declutr.jsonnet (everything else remained as is) and filtered my dataset to adhere to the minimum token length which does work in this particular setting. The results are I have ~10k documents left - enough for extended pretraining for a specific domain?

I ended up using min_length = 8 and max_length = 32 with a minimum token length of 128 in each document (derives from these settings since max_length is multiplied by 4 in the DeCLUTR setting).

My question is: Does this make sense (reducing min or max lengths for spans and using only "longer" documents) or how can I approach this issue when facing "not so long documents" as for instance in the example-wise wikitext-103 settings etc.?

Are there maybe some hints or "rules of thumbs" I can follow?

Thanks a lot for your help!

JohnGiorgi commented 3 years ago

Could you plot a histogram or something comparable of the token lengths of documents in your dataset? This would help in making decisions for min_length and max_length. Also, if the majority of data is short (e.g. less than a paragraph in length) I, unfortunately, don't think DeCLUTR is the most suitable approach.

With that said, modifying min_length and max_length is reasonable but there are a couple of things to keep in mind.

Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on. That way the model is trained and tested on text of similar length.
We haven't really experimented with a range of min_length and max_length in the paper, so I can only say with confidence that min_length=32 and max_length=512 are good defaults. Anything else you would have to experiment with.

Thus, I quickly ran into the "known" token/span length error (see

It is worth noting that this is not an error per say but a limitation of the model. You need enough tokens for the span sampling procedure to make any sense.

repodiac commented 3 years ago

OK, thanks for your insights anyway. I see, obviously I have to look for something else maybe... I don't see a useful point in merging/concatenating documents to meet your "defaults" :-/

What I don't understand: You mention that "Ideally min_length and max_length would be an upper and lower bound on the length of text you expect to do inference on." but you require the documents for training to have at least the multiple of 4 (!) of that upper bound!? This doesn't really make sense if you would like to do "fine tuning" of a language model for a dedicated domain. In this (my) case I would like to use exactly those kind of documents I expect to receive for inference (i.e. embedding) later...

JohnGiorgi commented 3 years ago

We require a multiple of 2 because we always sample at least two spans from each document (an anchor and a positive). The multiple increases when num_anchors > 1 (I think you are saying 4 because num_anchors == 2 by default?). This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable. I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

repodiac commented 3 years ago

OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x max_length) than you actually support for inference! This might be a serious issue for practical use, at least in my case.

(I think you are saying 4 because [num_anchors == 2 by default?]

yes

This doesn't mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Have only skimmed the paper, to be honest :)

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable.

Ok, will do.

I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are "huge"... no way I have similar length with my data.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

Thanks, I am already using Sentence Transformer model for extension as I wrote: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

repodiac commented 3 years ago

Just FYI:

I guess, I am on "uncharted lands" then with DeCLUTR and should probably look for another method which fits better to my use case.

Note: The x-axis show the length (i.e. number of tokens) and y-axis the number of documents with that length/number of tokens

JohnGiorgi commented 3 years ago

Yes, those are very short training examples. You could try lowering the max_length accordingly and see what kind of performance you can get. Otherwise, there are some great unsupervised sentence embedding models here that you may be able to train on your data.

JohnGiorgi commented 3 years ago

Closing, feel free to re-open if you are still having issues.

JohnGiorgi / DeCLUTR

Impact of "shorter" documents (span, number of tokens) for extended pretraining #235