Closed laiviet closed 2 years ago
I wrote the following script for my experiments, It may help you to convert BIO format to BART Template Format,
CorpusBIO.txt contains the lines, each has token, label pairs
# example input
IBM B-ORG
is O
a O
...
tokens=[]
labels=[]
for line in open("../CorpusBIO.txt"):
line=line.replace(';','')
if len(line.strip())>0:
token, label=line.split()
token=token.replace('"','')
token=token.replace("'","")
tokens.append(token)
labels.append(label)
else:
buffer_token=""
buffer_label=""
first=" ".join(tokens)
first=first.replace('"','')
first=first.replace(';','')
for l,t in zip(labels, tokens):
if l.split("-")[0]!= 'I' and buffer_token!="":
print('"%s";%s is a %s entity.' %(first,buffer_token, buffer_label))
buffer_token=""
buffer_label=""
if l.split("-")[0] =='B':
buffer_token=t
buffer_label=l.split("-")[1]
if l.split("-")[0] =='I':
buffer_token+=" "+ t
if buffer_token!="":
print('"%s";%s is a %s entity.' %(first,buffer_token, buffer_label))
tokens=[]
labels=[]
THanks!
Can you share the format of the input CSV files? Thank you, Viet