Closed VeriHeppi closed 18 hours ago
Where to get possible training/testing data for the LLM of choice?
avoindata.fi
ChatGPT prompt:
I am creating an LLM-based text classifier. The purpose of this project is to input text-based data and have the program analyse and classify themes and other things from the data. The program will run in cloud and will be created using Python, and will also use Docker.
What kind of data would be best for testing such a program? What format should the data be in? How much data is enough for tests?
Answer:
For testing your LLM-based text classifier, selecting the right data, format, and quantity is crucial for evaluating its performance. Here are some guidelines:
The best testing data depends on the nature of your text classification tasks, but here are some commonly used types:
Real-World Data: Try using real-world, unstructured text data to see how well the model handles the noise, such as typos, slang, or abbreviations, common in user-generated content.
Sources:
The most common format for text classification tasks is plain text files with corresponding labels. Here are a few formats:
CSV: Each row contains a text sample and a label. Example:
text,label
"The new smartphone model has excellent features.", "tech"
"The stock market saw significant growth today.", "business"
JSON: Useful for more complex structures where you may need to include additional metadata. Example:
[
{"text": "The new smartphone model has excellent features.", "label": "tech"},
{"text": "The stock market saw significant growth today.", "label": "business"}
]
TXT: Plain text files where each line contains the text followed by its label, often used for simple applications.
Hugging Face Dataset Format: If you're using libraries like Transformers by Hugging Face, you can directly load their datasets in a structured format.
Recommendation: Use CSV or JSON for initial testing, as they are human-readable and easy to manipulate in Python.
Stress Testing: For scaling tests or production-level deployment, larger datasets (50,000+ samples) are beneficial to assess how well the classifier handles large volumes of data.
General Rule:
Ensure you have an even distribution of samples across all categories to avoid skewing model performance toward any single class. For instance, if you're classifying text into five categories, aim for a roughly equal number of examples in each category. If your dataset is imbalanced, you might consider techniques like oversampling or using metrics like weighted accuracy during evaluation.
Data that could be good for software testing:
https://www.avoindata.fi/data/fi/dataset/helsingin-vapaa-aikakyselyn-vastaukset (2,4MB)
https://www.avoindata.fi/data/fi/dataset/turun-kaupungin-turvallisuuskysely (1,3MB)
This for larger datasets: https://www.kaggle.com/datasets
Find out where to get test data?