LLM-based Hazard Regularization using GNNs

Objective: In the previous dataset, severity levels were directly estimated using a Large Language Model (LLM). While the LLM performs well in distinguishing cases with extreme severity values, it produces less accurate results in situations where semantic overlap occurs between events.

To improve accuracy, we propose using only the events with extreme severity values (where the LLM is more reliable) and leveraging the relationships between events to create a more regularized dataset. After this process, we will filter for high-severity events only and use them as input for CLIMADA-BR. This approach aims to enhance data quality and ensure that only the most critical events are considered in climate impact analysis.

Use the dengue dataset to obtain a regularized event dataset by constructing a graph and applying a Graph Convolutional Network (GCN). This process will use SBERT for feature extraction, NetworkX and Scikit-Learn for graph construction, and GCN to classify intermediate severity levels.

Steps

Feature Extraction with SBERT:

Load the dengue CSV dataset from the previous step.
For each event, generate a feature vector using Sentence-BERT (SBERT).

Example code:

from sentence_transformers import SentenceTransformer
import pandas as pd

# Load dengue data
df = pd.read_csv('dengue_data.csv')

# Initialize SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate feature vectors for each event
df['features'] = df['event_description'].apply(lambda x: model.encode(x))

Graph Construction:

Create a graph using NetworkX, where each node is an event with an SBERT-generated feature vector.
Connect nodes based on feature similarity using Scikit-Learn’s NearestNeighbors.

Example code:

import networkx as nx
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Convert feature vectors to numpy array
X = np.vstack(df['features'].values)

# Use Nearest Neighbors to find connections
neighbors = NearestNeighbors(n_neighbors=5, metric='cosine').fit(X)
distances, indices = neighbors.kneighbors(X)

# Build the graph
G = nx.Graph()
for idx, neighbors in enumerate(indices):
 for neighbor in neighbors:
     if idx != neighbor:  # avoid self-loops
         G.add_edge(idx, neighbor)

Label Assignment:

Label nodes based on the severity of each event:
- Positive Label: High severity values.
- Negative Label: Low severity values.
- No Label: Intermediate severity values.

Example code:

# Define thresholds for positive and negative labels (example, adjust intervals)
high_severity_threshold = df['severity'].quantile(0.9)
low_severity_threshold = df['severity'].quantile(0.1)

labels = {}
for idx, severity in enumerate(df['severity']):
 if severity >= high_severity_threshold:
     labels[idx] = 1  # Positive
 elif severity <= low_severity_threshold:
     labels[idx] = -1  # Negative
 else:
     labels[idx] = 0  # No label

GCN Classification:

Apply a GCN classifier to predict labels for intermediate nodes.

Example code using Scikit-Network's GCN classifier:

from sknetwork.classification import GCNClassifier

# Prepare adjacency matrix for the GCN
adjacency = nx.adjacency_matrix(G)

# Known labels (use only labeled nodes for training)
known_labels = {node: label for node, label in labels.items() if label != 0}

# Initialize and train GCN https://scikit-network.readthedocs.io/en/master/tutorials/gnn/gnn_classifier.html

Event Filtering:
- Filter out events classified with a negative label for improved data analysis into CLIMADA-BR.
- Example code:
```
# Filter data based on GCN classification
filtered_df = df[df['gcn_label'] != -1]
```

Expected Outcomes

Refined severity labeling for dengue events for CLIMADA-BR

Labic-ICMC-USP / CLIMADA-BR

LLM-based Hazard Regularization using GNNs #11

Steps

Expected Outcomes