Add Feature Sub sampling

One of TabPFN's limitations is that it is not capable of handling more than 100 features in a data set. Our goal is to figure out some feature sub sampling strategy such that TabPFN's performance guarantees are still maintained. We are incorporating the following approaches:

[ ] Dimensionality reduction: PCA (unsupervised), lolP (supervised)
[ ] Sub-sampling: SelectKBest (supervised), KMeans (unsupervised, clustering into 100 clusters and then taking features closest to cluster centroids OR random sampling from each cluster)
[ ] Random feature selection

We can also parameterize this and potentially use something like "mix" for multiple sampling approaches in a single run. Parametrization might also help with grid search or other forms of hyper-parameter tuning approaches.

ersilia-os / chempfn

Add Feature Sub sampling #3