Increase (bearish) labeled data

StephanAkkerman commented 9 months ago

Bullish Sentiments: 17,368 Bearish Sentiments: 8,542 Neutral Sentiments: 12,181

It would be nice if we could balance the datasets and increase them all, to for instance 25k each

[ ] Find bullish tweet examples
[ ] Find bearish tweet examples
[x] Find neutral tweet examples
[ ] Create a prompt that we can use for Mixtral 8x7B for generating synthetic tweets
[ ] Use together.ai / pplx.ai / anyscale.com for getting the LLM model results

StephanAkkerman commented 9 months ago

Options are:

Using Simple Random Oversampling
Using SMOTE (Synthetic Minority Over-sampling Technique)
ADASYN (Adaptive Synthetic Sampling)
GAN

Full description: ADASYN (Adaptive Synthetic Sampling): This method focuses on generating synthetic data for the minority class, similar to SMOTE. However, ADASYN adapts to the dataset's characteristics and generates more synthetic data for minority class samples that are harder to learn, rather than treating all minority class samples equally.

SMOTE Variants: There are many variants of SMOTE, each designed to address specific issues or dataset characteristics. These include Borderline-SMOTE, K-Means SMOTE, and SVMSMOTE, among others. Each variant modifies the synthetic sample generation process in a way that is better suited for certain types of data.

Generative Adversarial Networks (GANs): GANs can generate new, synthetic samples of data that are remarkably similar to the original dataset. They can be particularly effective for complex data types, like images or text, where simpler oversampling techniques might not capture the nuances of the data.

Ensemble Methods: Some approaches combine oversampling with ensemble learning techniques, like bagging and boosting, to create more robust models. These methods can reduce the likelihood of overfitting, which is a common risk with oversampled data.

Advanced Algorithmic Approaches: Techniques like cluster-based oversampling, where the minority class is oversampled within clusters, can help in maintaining the intrinsic distribution of the minority class.

Deep Learning-Based Techniques: For certain types of data, especially in fields like NLP and image processing, deep learning models can be trained to augment the dataset with new, synthetic instances that are diverse and realistic.

TimKoornstra commented 9 months ago

GPT-3.5 or 4 might be able to create quality synthetic data

StephanAkkerman commented 9 months ago

ADASYN:

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import ADASYN
from datasets import load_dataset

# Load dataset
dataset = load_dataset(
    "TimKoornstra/financial-tweets-sentiment",
    split="train",
    cache_dir="data/finetune/",
)

# Extract texts and labels
texts = [example["tweet"] for example in dataset]
labels = [example["sentiment"] for example in dataset]

# Vectorize texts
tfidf = TfidfVectorizer(max_features=1000)  # Adjust the number of features as needed
X = tfidf.fit_transform(texts)

# Apply ADASYN to the subset with label 0
adasyn = ADASYN(random_state=42, n_neighbors=5, sampling_strategy="minority")
X_resampled, y_resampled = adasyn.fit_resample(X, labels)

# Convert X_resampled to a list of texts
texts_resampled = tfidf.inverse_transform(X_resampled)
print(texts_resampled)

StephanAkkerman commented 9 months ago

Synthetic Oversampling methods comparisons:

StephanAkkerman commented 9 months ago

We could also look into SentiGAN: https://github.com/Nrgeup/SentiGAN

StephanAkkerman commented 9 months ago

Both ADASYN and SMOTE return text like this 'amd', 'big', 'for', 'it', 'not', 'while', 'would'

Given the limitations of converting SMOTE's synthetic samples back to text, it's often more practical in NLP tasks to use other data augmentation techniques specifically designed for text, such as paraphrasing or generating new text samples using language models.

Generating new text samples for a minority class in an NLP task, especially for balancing datasets, can be more effective with methods specifically designed for text data. Here are some approaches that are often considered better suited for textual data augmentation:

Paraphrasing: Generate new sentences that convey the same meaning as existing sentences in the minority class. This can be done using:

Rule-based systems that change certain words or phrases while keeping the overall meaning intact. Machine translation models, where you translate the text to another language and then back to the original language (back-translation). Pre-trained models like T5 or BART that can be fine-tuned for paraphrasing tasks. Using Pre-trained Language Models: Leverage large pre-trained models like GPT-3, GPT-2, or BERT for text generation. You can prompt these models with existing sentences and let them generate continuations or variations.

Synonym Replacement: Replace words in sentences with their synonyms. This method keeps the sentence structure intact while slightly altering the content. Tools like NLTK or Spacy can be used to find synonyms.
Random Insertion, Deletion, and Swapping: Randomly insert, delete, or swap words in a sentence. This method creates variations of the existing sentences, although it can sometimes alter the meaning or grammatical correctness.
Data Augmentation Libraries: Use NLP data augmentation libraries like nlpaug or textattack, which offer a variety of text transformation techniques including the ones mentioned above.
Conditional Text Generation Models: Train a text generation model on your dataset, conditioned on the class label. This way, the model learns to generate text that is representative of a specific class.
Crowdsourcing or Expert Generation: If resources permit, manually creating new text samples through crowdsourcing or expert input can be highly effective, especially for domain-specific tasks.

StephanAkkerman commented 9 months ago

Generate 10k rows of the neutral labeled text to balance the classes. I will start with the following methods to test performance:

Simple Random Oversampling
Synonym replacement using nlpaug
GAN model
GPT API

StephanAkkerman commented 9 months ago

Research into GANs performance on balancing datasets: https://www-sciencedirect-com.proxy.library.uu.nl/science/article/pii/S1110866522000342

CatGAN (and more): https://github.com/williamSYSU/TextGAN-PyTorch

StephanAkkerman commented 9 months ago

Some good neutral examples from our dataset:

$TSLA is this going up, down, left, or right tomorrow 💫
Who is playing $fb ? Lol
make stock trading in the metaverse a #reality @SpeakerPelosi $FB
Should You Follow Berkshire Hathaway Into Apple Stock?
$NVDA sideways
Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
$AMZN 2266 is today target
Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
$WNRS https://t.co/i14tn4QZ8P
Thank you @jack for making @Twitter. It is time to move on. $TWTR

StephanAkkerman commented 9 months ago

Use the examples above in combination with Mixtral (available on HuggingChat) and the following prompt:

Create synthetic neutral tweets about the financial market. Use the following tweets as an example:
$TSLA is this going up, down, left, or right tomorrow 💫
Who is playing $fb ? Lol
make stock trading in the metaverse a #reality @SpeakerPelosi $FB
Should You Follow Berkshire Hathaway Into Apple Stock?
$NVDA sideways
Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
$AMZN 2266 is today target
Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
$WNRS https://t.co/i14tn4QZ8P
Thank you @jack for making @twitter. It is time to move on. $TWTR

TimKoornstra / FinTwitBERT

Increase (bearish) labeled data #18