Predictions look random

xiyuanzh commented 4 months ago

Hi,

Thanks a lot for sharing this repo! I trained the model and evaluated on the balance-scale and mfeat-fourier datasets (the first two datasets evaluated in TabPFN). For both datasets, the model predicts all rows as one class. Moreover, the accuracy on the training set stays around 0.5. May I know if there are any configurations I need to take care of? This is how I train and test the model:

python -m tab_pfn.main train debug debug python -m tab_pfn.main infer balance-scale debug/model_4095.pt debug_output --class-col target

Thanks!

Ipsedo commented 4 months ago

Maybe you don't have trained the model enough. As I remember there is a strange behavior : the model takes a lot of time to start convergence. Here you use the first save which leads to random predictions.

Try to evaluate with the saved model in resources folder, or train yours more time. The strange behavior is reflected in the metric (see it on mlflow) : after about thousands backward steps, the metric begin to decrease and the model has quite good results (not good as original paper but not so bad).

xiyuanzh commented 4 months ago

Thank you so much for the prompt response! I will try training the model longer to see the performance.

I tried loading the model under the resource folder, but there was an error "_pickle.UnpicklingError: invalid load key, 'v'." when the following line of code in infer.py was executed: "tab_pfn.load_state_dict(th.load(infer_options.state_dict))".

I also made two changes to your code, which I want to check if my understanding is correct.

The original get_tgt_mask() function masks all values except the diagonal values, and the outputs from self.trf_dec are NaNs. I change the __get_tgt_mask() function to mask upper diagonal values. More specifically,

def __get_tgt_mask(self, x_test: th.Tensor) -> th.Tensor:

    device = self.__get_device()
    sz = x_test.size(1)  
    mask = th.triu(th.ones(sz, sz, device=device) * float('-inf'), diagonal=1)
    mask = mask.repeat(x_test.size(0) * self.__nheads, 1, 1)  
    return mask

The following two lines in scm.py shows syntax errors

x = outs_stacked[:, *self.__zx_nodes_idx].squeeze(-1)
y = outs_stacked[:, *self.__zy_node_idx].squeeze(-1)

I changed them to

x = outs_stacked[:, self.__zx_nodes_idx[0], self.__zx_nodes_idx[1]].squeeze(-1)
y = outs_stacked[:, self.__zy_node_idx[0], self.__zy_node_idx[1]].squeeze(-1)

Could you help check if my understanding is correct? Thanks a lot!

Ipsedo commented 4 months ago

For the target mask it needs to be a diagonal : there isn't relation between target observations so one target only see itself inside the transformer (as my understand of the paper).

For the starred expression within a slice, which python version do you use? It may be a new feature of recent python version.

For the state dict loading which fails, let me check that. Could you try it on develop-sam branch? I possibly have done modifications and I need to re-push the actualized model state dict. Also which PyTorch version do you use?

xiyuanzh commented 4 months ago

Thanks so much for the explanation! This is very helpful! I ran model_183295.pt on develop-sam branch across all the 30 test datasets in TabPFN and attached the results below. The mean accuracy is around 0.8. Are these numbers close to what you reproduced? Thanks!

0 balance-scale 0.891026 1 mfeat-fourier 0.795000 2 breast-w 0.974212 3 mfeat-karhunen 0.959000 4 mfeat-morphological 0.724000 5 mfeat-zernike 0.818000 6 cmc 0.523098 7 credit-approval 0.843478 8 credit-g 0.758000 9 diabetes 0.736979 10 tic-tac-toe 0.736952 11 vehicle 0.747045 12 eucalyptus 0.668478 13 analcatdata_authorship 0.976190 14 analcatdata_dmft 0.218593 15 pc4 0.901235 16 pc3 0.891165 17 kc2 0.819923 18 pc1 0.927798 19 banknote-authentication 0.957746 20 blood-transfusion-service-center 0.840607 21 ilpd 0.676976 22 qsar-biodeg 0.995627 23 wdbc 0.788770 24 cylinder-bands 0.722222 25 dresses-sales 0.596000 26 MiceProtein 0.994444 27 car 0.731959 28 steel-plates-fault 0.944444 29 climate-model-simulation-crashes 0.853009 mean 0.800399

Ipsedo commented 4 months ago

Yes it looks like exactly what I can reach about metrics. If all is okay for your side you can close this issue. And if you like this implementation don't hesitate to star and share it :)

xiyuanzh commented 4 months ago

Sure thanks! I also ran into this error: "RuntimeError: normal expects std >= 0.0, but found std -inf" for the following line of code "nn.init.normal_(module.weight, std=tnlufloat(1e-2, 10, 1e-8))" in scm.py after about 20K iterations. Is this expected? I change this line to "nn.init.normal(module.weight, std=max(0, tnlu_float(1e-2, 10, 1e-8)))".

Ipsedo commented 3 months ago

No it's not the expected behaviour of TNLU, it may be a mistake from my side about its implementation. I will try to fix it by re-reading the paper (and also add unit tests on it!). I will tell you when I successfully fixed it ;)

Ipsedo commented 3 months ago

I think I have fixed it by :

using right truncated normal formulas (from wikipedia)
use it in tnlu function

What I have seen during test execution is numerical precision issue (getting -10.0001 when the lower bound is at -10 for example), to avoid this I explicitly clamp truncated normal results with its bounds.

Can you test it on your side ?

xiyuanzh commented 3 months ago

Thanks so much for the update! I tested it and found the loss quickly went to nan after ~60k iterations. Did you observe similar phenomenon on your side?

Ipsedo commented 3 months ago

Sorry for the delay to answer. I don't have seen this numerical issue.

Maybe my SCM implementation is not equal to what they done in the original paper : there are many subtleties that I have arbitrary resolved. Or maybe the default hyper-parameters in the main script are not good causing this numerical issue. I will try to re-train it the next week and see if I also have NaN during the training.

Can you share me all the hyper-parameters you choose?

xiyuanzh commented 3 months ago

Hi thanks so much for your response! I tried multiple runs and all got NaNs for the loss. I used default hyper-parameters, i.e., "python -m tab_pfn.main --cuda train run_debug model_debug --batch-size 10". Please let me know if I need to provide any additional information, thanks!

Ipsedo / TabPFN

Predictions look random #8