NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

Converting .spacy files to conll format to train other models on it. #64

Closed Akshay0799 closed 2 years ago

Akshay0799 commented 2 years ago

Once I fit the aggregation model on the data, I used Skweak's function to write it as a Docbin file which will get saved as a .spacy file. How do I convert this into a normal CoNLL format file. Are there any libraries or tools that can do that ?

plison commented 2 years ago

I haven't tested it myself, but it seems that this is what this library does: https://spacy.io/universe/project/spacy-conll#gatsby-noscript

Akshay0799 commented 2 years ago

I'm getting a unicode error when I try that library to convert the docbin file to a conll one.

azucker99 commented 2 years ago

@Akshay0799 it's pretty easy to do this manually. Here's what you can do to write docs from a .spacy to .conll:

import spacy
from spacy.tokens import DocBin

# Loading the docs from the Docbin object
nlp = spacy.blank('en')
doc_bin = DocBin().from_disk('path/to/spacy_file')
docs = doc_bin.get_docs(nlp.vocab)

# Lists to hold the tokens and tags
token_list = []
tag_list = []

# Adding the tokens and tags to the list
for doc in docs:
    for idx, token in enumerate(doc):
        if token.ent_iob_ != "O":
            if idx == 0 or doc[idx-1].ent_iob_ == 'O':
                tag_list.append("B-" + token.ent_type_)
            else:
                tag_list.append("I-"+token.ent_type_)
       else:
           tag_list.append("O")
       token_list.append(token.text)

# Writing the tokens and tags to a conll file
with open('data.conll', 'w') as f:
    for token, tag in zip(token_list,tag_list):
        print(token+'\t'+tag+'\r', file = f)

Hope this was helpful.

tejacyentia commented 2 years ago

@azucker99 There's a minor issue in your code. In the section where you say

if idx == 0 or doc[idx-1].ent_iob_ == 'O': tag_list.append("B-" + token.ent_type_) you want to say

if idx == 0 or doc[idx-1].ent_iob_ == 'O' or doc[idx-1].ent_iob_ != doc[idx].ent_iob_: tag_list.append("B-" + token.ent_type_)

Two different tags can appear right after another, and it's important to make note of that fact. I know because I made that error when I wrote some very similar code.

@Akshay0799 Please take note if you decide to use @azucker99's code

Akshay0799 commented 2 years ago

@Akshay0799 it's pretty easy to do this manually. Here's what you can do to write docs from a .spacy to .conll:

import spacy
from spacy.tokens import DocBin

# Loading the docs from the Docbin object
nlp = spacy.blank('en')
doc_bin = DocBin().from_disk('path/to/spacy_file')
docs = doc_bin.get_docs(nlp.vocab)

# Lists to hold the tokens and tags
token_list = []
tag_list = []

# Adding the tokens and tags to the list
for doc in docs:
    for idx, token in enumerate(doc):
        if token.ent_iob_ != "O":
            if idx == 0 or doc[idx-1].ent_iob_ == 'O':
                tag_list.append("B-" + token.ent_type_)
            else:
                tag_list.append("I-"+token.ent_type_)
       else:
           tag_list.append("O")
       token_list.append(token.text)

# Writing the tokens and tags to a conll file
with open('data.conll', 'w') as f:
    for token, tag in zip(token_list,tag_list):
        print(token+'\t'+tag+'\r', file = f)

Hope this was helpful.

Thank you