HazyResearch / bootleg

Self-Supervision for Named Entity Disambiguation at the Tail
http://hazyresearch.stanford.edu/bootleg
Apache License 2.0
213 stars 27 forks source link

Is there any way to replace the current NER ? #107

Closed coolcoder001 closed 2 years ago

coolcoder001 commented 2 years ago

Hi , Thanks a lot for the project .It is indeed wonderful.

However , I would like to replace NER engine . I want to use Flair , instead of Spacy.

Can I do that ?

lorr1 commented 2 years ago

Hi!

Yes, you can do this. I have a list of possible extractors here. If you want to implement your own extractor function and add it there, you should be able to trigger it being used via this argument here.

As long as you have the same inputs/outputs, it should be possible.

coolcoder001 commented 2 years ago

Hi, Thanks a lot for the quick response. :) My extractor function using flair takes input as a string and outputs the extracted entities in a pandas dataframe.

def entity_recognition(text):
    """Given a text document, run a NER on it using flair and return a dataframe with the following columns
    text: actual raw text input
    entity: identified entity text
    entity_start: character start position of entity in raw text
    entity_end: character end position of entity in raw text
    """
    import pandas as pd
    from flair.data import Sentence
    from flair.models import SequenceTagger
    tagger_fast = SequenceTagger.load('ner-ontonotes-fast')
    sentence = Sentence(text)
    tagger_fast.predict(sentence, mini_batch_size=16)
    entities = []
    for i in tqdm(range(len(sentence.to_dict(tag_type='ner')['entities']))):
        str_main=None
        start_pos = -1
        end_pos = -1
        if str(sentence.to_dict(tag_type=
                                'ner')['entities'][i]['labels']
                [0]).split()[0] in 'ORG':
            str_main = str(sentence.to_dict(tag_type='ner')['entities'][i]
                        ['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']

        elif str(sentence.to_dict(tag_type=
                                    'ner')['entities'][i]['labels']
                    [0]).split()[0] in 'PERSON':
            str_main = str(sentence.to_dict(tag_type=
                                        'ner')['entities'][i]['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']

        elif str(sentence.to_dict(tag_type=
                                    'ner')['entities'][i]['labels']
                    [0]).split()[0] in 'GPE':
            str_main = str(sentence.to_dict(tag_type=
                                        'ner')['entities'][i]['text'])
            start_pos = sentence.to_dict(tag_type='ner')['entities'][i]['start_pos']
            end_pos = sentence.to_dict(tag_type='ner')['entities'][i]['end_pos']
        if str_main is not None and (start_pos!=-1 and end_pos!=-1):
            entities.append([str_main, start_pos, end_pos])

    entities = pd.DataFrame(entities, columns=['entity', 'entity_start', 'entity_end'])
    entities['text'] = text
    return entities

Can you please help me with the changes I need to make to this function so that it can work with bootleg?

Thanks in advance.

lorr1 commented 2 years ago

So I went ahead and added your function as an example in the branch here. If you use the annotator and use the extract method of custom, it should trigger your extractor. I haven't tested it but it should get you started.

coolcoder001 commented 2 years ago

Hi @lorr1 , thanks a lot for your help. You are so nice and awesome :)

I am able to run this code using the Flair NER engine.

However, if I have to do some more changes, can I directly push them to the branch you created? or do I need to raise PR ?

lorr1 commented 2 years ago

How about you raise PRs? I'll pretty much approve everything, but I'd like to keep track of what you're finding difficult/useful to implement.

Thanks!