The PAN dataset is highly imbalanced, with about 3% predatory and 97% non-predatory messages.
As a result, we cannot train the model properly to detect predatory messages. We must use oversampling methods such as SMOTE to produce more positive labels.
The PAN dataset is highly imbalanced, with about 3% predatory and 97% non-predatory messages. As a result, we cannot train the model properly to detect predatory messages. We must use oversampling methods such as SMOTE to produce more positive labels.