automl / TabPFN

Official implementation of the TabPFN paper (https://arxiv.org/abs/2207.01848) and the tabpfn package.
http://priorlabs.ai
Apache License 2.0
1.22k stars 109 forks source link

Potential issue in sampling choice variables #78

Closed amueller closed 10 months ago

amueller commented 10 months ago

So I'm not entirely sure about this since I'm still trying to go through the code, but I think there might be an issue in the choice sampling in DifferentiableHyperParameter, in particular with older versions of python.

You're doing

def make_choice(**choices):
    choices = torch.tensor([1.0] + [choices[i] for i in choices], dtype=torch.float)
    weights = torch.softmax(choices, 0)  # create a tensor of weights
    sample = torch.multinomial(weights, 1, replacement=True).numpy()[0]
    return self.choice_values[sample]

But choices is a dictionary, and dictionary keys are not ordered prior to Python3.7, but you rely on the ordering to index self.choice_values. Even in newer python, I think it's a bit iffy to rely on the ordering of dictionary keys since it's by insertion order and that can sometimes be unexpected. This only seems to make a difference if you actually specified weights, so in the public code only for the dataset sampling. It seems to work out there but that might just be by chance and not reproducible across different platforms.

amueller commented 10 months ago

Oh I just realized the switch to ordered keys was in 3.7 not 3.8 as I thought, so it's probably reproducible but also it's a bit non-obvious that the ordering is the correct one.