ProjectCED / CED-LLM

Classify and Enhance Data with LLM - Project for Tampere University course Software Engineering Project. Working with Solita.
MIT License
0 stars 0 forks source link

Test Data Gathering #13

Closed VeriHeppi closed 18 hours ago

VeriHeppi commented 1 week ago

Find out where to get test data?

VeriHeppi commented 1 week ago

Where to get possible training/testing data for the LLM of choice?

VeriHeppi commented 1 week ago

avoindata.fi

Lauri-Niemela commented 1 day ago

ChatGPT prompt:

I am creating an LLM-based text classifier. The purpose of this project is to input text-based data and have the program analyse and classify themes and other things from the data. The program will run in cloud and will be created using Python, and will also use Docker.

What kind of data would be best for testing such a program? What format should the data be in? How much data is enough for tests?

Answer:

For testing your LLM-based text classifier, selecting the right data, format, and quantity is crucial for evaluating its performance. Here are some guidelines:

1. Types of Data for Testing

The best testing data depends on the nature of your text classification tasks, but here are some commonly used types:

2. Format of the Data

The most common format for text classification tasks is plain text files with corresponding labels. Here are a few formats:

3. How Much Data Is Enough for Tests?

4. Balancing Data for Testing

Ensure you have an even distribution of samples across all categories to avoid skewing model performance toward any single class. For instance, if you're classifying text into five categories, aim for a roughly equal number of examples in each category. If your dataset is imbalanced, you might consider techniques like oversampling or using metrics like weighted accuracy during evaluation.

Conclusion

Lauri-Niemela commented 1 day ago

Data that could be good for software testing:

https://www.avoindata.fi/data/fi/dataset/eduskuntavaalien-2015-ylen-vaalikoneen-vastaukset-ja-ehdokkaiden-taustatiedot (13MB)

https://www.avoindata.fi/data/fi/dataset/helsingin-vapaa-aikakyselyn-vastaukset (2,4MB)

https://www.avoindata.fi/data/fi/dataset/turun-kaupungin-turvallisuuskysely (1,3MB)

VeriHeppi commented 1 day ago

This for larger datasets: https://www.kaggle.com/datasets