top_p argument is used like 1-top_p

For example, top_p=0.999 gives you nearly deterministic sampling, not nearly on-distribution sampling.

I was confused why I was getting much less diverse samples with top_p=0.95 than I got with top_p turned off.

I found the cause in these lines:

https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L11-L14

threshold is set to top_p here:

https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L101-L102

Suppose eg threshold is 0.95. Then 1-threshold is 0.05.

So we remove all tokens where the cumulative probs are > 0.05, which is most of the tokens -- we are really doing top-p sampling with top_p=0.05 (in the usual convention), not the intended top_p=0.95.

Aleph-Alpha / magma

top_p argument is used like 1-top_p #29