[Question]: Issue with NER Model: BIO-format Labels Not Recognized

SPVillacorta commented 1 year ago

Question

Hi Flair Community, I'm attempting to train a NER model using Flair but my BIO-formatted labels are not recognised. I've converted my CSV annotations to CoNLL format and checked for correct loading and this is the code I tried to use:

# Imports and other setup
import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.data import Sentence, Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"
PDF_DIR = "./pdfs"

# Function to convert CSV to CoNLL
def csv_to_conll(csv_file, conll_file):
    df = pd.read_csv(csv_file)

    with open(conll_file, 'w') as f:
        for index, row in df.iterrows():
            # Check if the row is entirely composed of NaN values
            if pd.isna(row['text']) and pd.isna(row['label']):
                f.write("\n")
                continue

            word = row['text']
            label = row['label']

            # This checks if either 'text' or 'label' is NaN, and skips that row with a warning
            if pd.isna(word) or pd.isna(label):
                print(f"Warning: Skipping row {index} due to NaN value.")
                continue

            f.write(f"{word}\t{label}\n")

# Convert CSV files to CoNLL format
csv_to_conll(f"{DATA_DIR}/train.csv", f"{DATA_DIR}/train.conll")
csv_to_conll(f"{DATA_DIR}/dev.csv", f"{DATA_DIR}/dev.conll")
csv_to_conll(f"{DATA_DIR}/test.csv", f"{DATA_DIR}/test.conll")

# Function to convert PDF to CoNLL
def pdf_to_conll(pdf_dir: str, data_dir: str):
    pdf_paths = glob.glob(os.path.join(pdf_dir, "*.pdf"))
    texts = []

    for pdf_path in pdf_paths:
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join([page.extract_text() for page in pdf.pages])
            texts.append(text)

    with open(os.path.join(data_dir, "pdfs.conll"), "w") as f:
        for text in texts:
            sentences = nltk.sent_tokenize(text)
            for sentence in sentences:
                sentence = sentence.replace("\n", " ").replace("\t", " ")
                f.write(f"{sentence}\tO\n")
            f.write("\n")
    return texts

# Function to train the model
def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)

    # Assuming CoNLL formatted CSV files are named as train.conll, dev.conll, test.conll
    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file='train.conll',
                                  dev_file='dev.conll',
                                  test_file='test.conll')

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    trainer.train(
        model_dir,
        learning_rate=0.2,
        mini_batch_size=30,
        max_epochs=100,
    )

# Call your train function
train(DATA_DIR, MODEL_DIR)

When executing, the F-score, precision, and recall are all zero. Any ideas on what could be going wrong?

nvenkat94 commented 1 year ago

I'm having same issue

alanakbik commented 1 year ago

Sorry for the late reply! @SPVillacorta did you solve the problem? If not, could you share a snippet of the dataset you are loading?

@nvenkat94 could you expand on your problem?

SPVillacorta commented 1 year ago

ok the "train.conll" looks like the following:

matching O i.e. O presumably O from O Mamba O These O since O prospectivity I-PROCESS fibrous O base O ore O the O 20 O based O Andy O simply O martite B-MINERAL Bungaroo B-PLACE The O on O between O 250 O the O The O the O below O are O virtually O oxides O skin O Gole O to O all O published O southern O deposits B-ORE_DEPOSIT sorted O

nvenkat94 commented 1 year ago

Thanks for your valuable response @alanakbik My issue has been fixed. Earlier my data has "O" before "I-", after revised input data issue has been fixed. @SPVillacorta Input Data has issue with "I-" tag. If there is "I-", Their previous tag should be "B-".

Tag Details: B-: Beginning I- : Intermediate O-: outside

your data should be in following format

`matching O
i.e. O
presumably O
from O
Mamba O
These O
since O
prospectivity B-PROCESS
fibrous O
base O
ore O
the O
20 O
based O
Andy O
simply O
martite B-MINERAL
Bungaroo B-PLACE
`

alanakbik commented 1 year ago

Thanks for sharing the info! Yes, in IOB2 the first tag should be a B-. @SPVillacorta does this fix your issue?

adambuttrick commented 9 months ago

I just ran into this issue attempting to load training data like so, based on an example I found elsewhere:

from flair.data import Corpus
from flair.datasets import ColumnCorpus
import torch

columns = {0: 'text', 1: 'ner'}
tag_type = 'ner'
corpus = ColumnCorpus('/content/drive/MyDrive/training_data/flair/', columns)
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)

and then noticed the deprecation message about make_tag_dictionary being replaced with make_label_dictionary and so switched to:

tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)

...at which point the data loaded successfully.

The behavior around the deprecated loader and message make it seem as if it still works, especially if you don't check the tag dictionary itself, but it does not appear to do so. Just commenting to flag and in case anyone else comes across this issue, looking to resolve.

flairNLP / flair

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

Question