Impact of using widely referenced open source data sets

rupnic commented 3 months ago

New to the field and might be completely off the mark here - but was any consideration given to the fact that because the data sets used are fairly widely referenced and repeated that they might have formed part of the original foundational training data for the models and this might have boosted model performance vs. using a novel dataset?

colfeng commented 3 months ago

This is a good question. I think we can discuss this from two perspectives: training and evaluation.

From the training perspective, we can discuss whether the train-test data split in our project is reasonable, and whether the data (including the test set) has been used by other models. Firstly, our train-test sets are strictly separated, so there is no data leakage issue between them. Of course, due to the more similar distribution of our training and testing data, training with our data is likely to perform better on our test set than training with new data. This is the generalization problem often discussed in deep learning and is a capability that can be further explored and discussed. Secondly, since these LLMs rarely fully disclose their complete training data, we cannot be 100% sure that other models have not been trained on related data. However, since the original data is purely tabular data and not in easily readable formats like JSON, to our knowledge, we are the first to convert it into text form for LLM training and evaluation. In addition, based on our evaluation results, most LLMs perform poorly. We believe that existing models have most likely not been trained on this dataset.
From the evaluation perspective, we are happy to collaborate with introducing new datasets for evaluation. As we mentioned in our first point, introducing new datasets for training and testing on the old datasets, or introducing new datasets for testing the old datasets used for training, can both help to further estimate the generalization ability of these models. Therefore, we also welcome everyone to collaborate with us to make more contributions in the field of credit and risk.

rupnic commented 3 months ago

Thanks for the prompt response. The train-test split approach seems standard and reasonable. It still concerns me a little that the foundation model training sources might impact performance for these widely available sets. As you suggest, it is difficult to evaluate training data used in the foundation models. Nevertheless, even if one poses an elementary question to GPT-4 such as "which features are typically considered redundant in the porto seguro dataset" it is able to reasonably suggest redundant features by category and it also has knowledge of the best performing models for the data set. Clearly that is some way off from recognizing the data being received and evaluated in your analysis - but it suggests there might be some advantage derived from the foundation training. Would be reassuring to see some results were someone to contribute a tokenized but previously undisclosed dataset for which they have SOTA scores.

colfeng commented 3 months ago

Thank you for your valuable suggestion. As we mentioned previously, we do currently are working with some collaborators to gather new and undisclosed datasets for evluatrion (maybe update this year, depends on the lisence of the private dataset), and if you have any relevant data that you'd be willing to share, we would be happy to collaborate with you to further investigate this matter.

The-FinAI / CALM

Impact of using widely referenced open source data sets #5