This project aims to classify tweets related to different types of disasters using Natural Language Processing (NLP) techniques. The model is built using the Hugging Face Transformers library, specifically leveraging a pre-trained BERT model. This project aligns with Sustainable Development Goal (SDG) 13: Climate Action by enabling the classification and analysis of disaster-related content on social media, which can assist in disaster response and management.
This project addresses SDG 13 by creating a tool to predict and classify disaster-related tweets. The model helps identify the type of disaster, which can aid in disaster response and resource allocation, ultimately supporting climate action and disaster management efforts.
The primary goal is to fine-tune a pre-trained BERT model to accurately classify tweets into six disaster categories: Drought, Earthquake, Wildfire, Floods, Hurricanes, and Tornadoes. This classification can be used to improve the response to and management of disasters as they occur.
Source: The dataset consists of tweets related to various types of disasters, sourced from Kaggle.
Features:
Tweets
: The text of the tweet.Disaster
: The type of disaster mentioned in the tweet, categorized into six labels: Drought, Earthquake, Wildfire, Floods, Hurricanes, and Tornadoes.Dataset Mapping:
Ensure that you have the required libraries installed in your Python environment:
pip install transformers
pip install torch
pip install pandas
pip install scikit-learn
Clone the project repository to your local machine:
git clone https://github.com/yourusername/disaster-prediction.git
cd disaster-prediction
Make sure the Disaster.csv
file is present in the project directory. This file contains the tweets and their corresponding disaster labels.
Before training the model, it’s important to understand the dataset through Exploratory Data Analysis (EDA):
bert-base-uncased
model from the Hugging Face model hub due to its robustness in handling a wide range of Natural Language Processing (NLP) tasks, including text classification.AutoTokenizer
associated with BERT is used to tokenize the input tweets, ensuring that they are in a format suitable for the model.Load the pre-trained BERT model and its tokenizer using the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)
This model will be fine-tuned to adapt it specifically to the disaster tweet classification task.
For demonstration purposes, we sampled 1000 tweets from the dataset. This sample size is manageable for running experiments while still being large enough to train a reliable model.
Split the dataset into training and testing sets:
The tweets are tokenized using the BERT tokenizer, which converts the raw text into tokens that the model can process:
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True)
Tokenization includes truncating the text to a specific length and padding shorter texts to ensure uniform input size.
Fine-tuning is done by setting specific training parameters:
The model is trained using a custom loop, which optimizes the model's parameters to minimize the loss function:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
# Implement the training loop here
The AdamW
optimizer is used as it’s well-suited for fine-tuning transformer models like BERT.
The model’s performance is evaluated on the test set using various metrics:
The model’s performance before and after fine-tuning is compared to assess the effectiveness of the fine-tuning process. This comparison helps to understand how much the model has improved after being adapted specifically to the disaster tweet classification task.
After fine-tuning, the model can be used to predict the type of disaster mentioned in new, unseen tweets:
new_texts = ["The smoke from the wildfire is affecting air quality in nearby cities."]
new_encodings = tokenizer(new_texts, truncation=True, padding=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**new_encodings)
predictions = torch.argmax(outputs.logits, dim=-1)
print(predictions.item()) # Outputs the predicted disaster type as an integer
This example demonstrates how the model can be applied to real-world scenarios by analyzing new tweets and classifying them into disaster categories.
The high accuracy and F1-score indicate that the model performs exceptionally well in classifying disaster-related tweets.
The model was tested on various disaster-related texts and correctly identified the disaster types, showcasing its potential utility in disaster response and management.
This project successfully demonstrates the application of Hugging Face Transformers to fine-tune a BERT model for disaster tweet classification. The model achieved high accuracy, making it a valuable tool in the context of disaster response. The project also highlights the potential of AI and NLP in contributing to SDG 13: Climate Action, providing a practical example of how technology can aid in managing and responding to disasters.
The application is deployed using Streamlit, providing an interactive interface where users can input tweets and receive predictions on the type of disaster mentioned in the tweet.
To run the Streamlit application, use the following command in your terminal:
streamlit run app.py
This command will start a local server where you can interact with the disaster tweet classification model through a user-friendly interface.