MAGMA - a GPT-style multimodal model that can understand any combination of images and language. NOTE: The freely available model from this repo is only a demo. For the latest multimodal and multilingual models from Aleph Alpha check out our website https://app.aleph-alpha.com
Suppose eg threshold is 0.95. Then 1-threshold is 0.05.
So we remove all tokens where the cumulative probs are > 0.05, which is most of the tokens -- we are really doing top-p sampling with top_p=0.05 (in the usual convention), not the intended top_p=0.95.
For example,
top_p=0.999
gives you nearly deterministic sampling, not nearly on-distribution sampling.I was confused why I was getting much less diverse samples with top_p=0.95 than I got with top_p turned off.
I found the cause in these lines:
https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L11-L14
threshold
is set totop_p
here:https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L101-L102
Suppose eg
threshold
is 0.95. Then1-threshold
is 0.05.So we remove all tokens where the cumulative probs are > 0.05, which is most of the tokens -- we are really doing top-p sampling with top_p=0.05 (in the usual convention), not the intended top_p=0.95.