One of TabPFN's limitations is that it is not capable of handling more than 100 features in a data set. Our goal is to figure out some feature sub sampling strategy such that TabPFN's performance guarantees are still maintained. We are incorporating the following approaches:
[ ] Sub-sampling: SelectKBest (supervised), KMeans (unsupervised, clustering into 100 clusters and then taking features closest to cluster centroids OR random sampling from each cluster)
[ ] Random feature selection
We can also parameterize this and potentially use something like "mix" for multiple sampling approaches in a single run. Parametrization might also help with grid search or other forms of hyper-parameter tuning approaches.
One of TabPFN's limitations is that it is not capable of handling more than 100 features in a data set. Our goal is to figure out some feature sub sampling strategy such that TabPFN's performance guarantees are still maintained. We are incorporating the following approaches:
We can also parameterize this and potentially use something like "mix" for multiple sampling approaches in a single run. Parametrization might also help with grid search or other forms of hyper-parameter tuning approaches.