Open rupnic opened 3 months ago
This is a good question. I think we can discuss this from two perspectives: training and evaluation.
Thanks for the prompt response. The train-test split approach seems standard and reasonable. It still concerns me a little that the foundation model training sources might impact performance for these widely available sets. As you suggest, it is difficult to evaluate training data used in the foundation models. Nevertheless, even if one poses an elementary question to GPT-4 such as "which features are typically considered redundant in the porto seguro dataset" it is able to reasonably suggest redundant features by category and it also has knowledge of the best performing models for the data set. Clearly that is some way off from recognizing the data being received and evaluated in your analysis - but it suggests there might be some advantage derived from the foundation training. Would be reassuring to see some results were someone to contribute a tokenized but previously undisclosed dataset for which they have SOTA scores.
Thank you for your valuable suggestion. As we mentioned previously, we do currently are working with some collaborators to gather new and undisclosed datasets for evluatrion (maybe update this year, depends on the lisence of the private dataset), and if you have any relevant data that you'd be willing to share, we would be happy to collaborate with you to further investigate this matter.
New to the field and might be completely off the mark here - but was any consideration given to the fact that because the data sets used are fairly widely referenced and repeated that they might have formed part of the original foundational training data for the models and this might have boosted model performance vs. using a novel dataset?