Closed Akshay0799 closed 2 years ago
I haven't tested it myself, but it seems that this is what this library does: https://spacy.io/universe/project/spacy-conll#gatsby-noscript
I'm getting a unicode error when I try that library to convert the docbin file to a conll one.
@Akshay0799 it's pretty easy to do this manually. Here's what you can do to write docs from a .spacy to .conll:
import spacy
from spacy.tokens import DocBin
# Loading the docs from the Docbin object
nlp = spacy.blank('en')
doc_bin = DocBin().from_disk('path/to/spacy_file')
docs = doc_bin.get_docs(nlp.vocab)
# Lists to hold the tokens and tags
token_list = []
tag_list = []
# Adding the tokens and tags to the list
for doc in docs:
for idx, token in enumerate(doc):
if token.ent_iob_ != "O":
if idx == 0 or doc[idx-1].ent_iob_ == 'O':
tag_list.append("B-" + token.ent_type_)
else:
tag_list.append("I-"+token.ent_type_)
else:
tag_list.append("O")
token_list.append(token.text)
# Writing the tokens and tags to a conll file
with open('data.conll', 'w') as f:
for token, tag in zip(token_list,tag_list):
print(token+'\t'+tag+'\r', file = f)
Hope this was helpful.
@azucker99 There's a minor issue in your code. In the section where you say
if idx == 0 or doc[idx-1].ent_iob_ == 'O': tag_list.append("B-" + token.ent_type_)
you want to say
if idx == 0 or doc[idx-1].ent_iob_ == 'O' or doc[idx-1].ent_iob_ != doc[idx].ent_iob_: tag_list.append("B-" + token.ent_type_)
Two different tags can appear right after another, and it's important to make note of that fact. I know because I made that error when I wrote some very similar code.
@Akshay0799 Please take note if you decide to use @azucker99's code
@Akshay0799 it's pretty easy to do this manually. Here's what you can do to write docs from a .spacy to .conll:
import spacy from spacy.tokens import DocBin # Loading the docs from the Docbin object nlp = spacy.blank('en') doc_bin = DocBin().from_disk('path/to/spacy_file') docs = doc_bin.get_docs(nlp.vocab) # Lists to hold the tokens and tags token_list = [] tag_list = [] # Adding the tokens and tags to the list for doc in docs: for idx, token in enumerate(doc): if token.ent_iob_ != "O": if idx == 0 or doc[idx-1].ent_iob_ == 'O': tag_list.append("B-" + token.ent_type_) else: tag_list.append("I-"+token.ent_type_) else: tag_list.append("O") token_list.append(token.text) # Writing the tokens and tags to a conll file with open('data.conll', 'w') as f: for token, tag in zip(token_list,tag_list): print(token+'\t'+tag+'\r', file = f)
Hope this was helpful.
Thank you
Once I fit the aggregation model on the data, I used Skweak's function to write it as a Docbin file which will get saved as a .spacy file. How do I convert this into a normal CoNLL format file. Are there any libraries or tools that can do that ?