Open cooper12121 opened 2 years ago
hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!
hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!
there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you
hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!
there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you
Anyway,Thank you for your reply!I
def convert_file(input_file, output_file, tag2query_file):
"""
Convert MSRA raw data to MRC format
"""
origin_count = 0
new_count = 1
tag2query = json.load(open(tag2query_file))
mrc_samples = []
contexts, labels = [], []
with open(input_file) as fin:
for line in fin:
line = line.strip()
if line:
context, label = line.split(" ")
contexts.append(context)
labels.append(label)
else:
tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(contexts, labels)])
# tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(src.split(), labels.split())])
for label, query in tag2query.items():
start_position = [tag.begin for tag in tags if tag.tag == label]
end_position = [tag.end-1 for tag in tags if tag.tag == label]
mrc_samples.append(
{
"qas_id": "{}.{}".format(origin_count, new_count),
"context": " ".join(contexts),
"start_position": start_position,
"end_position": end_position,
"query": query,
"impossible": False if start_position and end_position else True,
"entity_label": label,
"span_position": [f"{start};{end}" for start, end in zip(start_position, end_position)]
}
)
new_count += 1
contexts, labels = [], []
origin_count += 1
new_count = 1
json.dump(mrc_samples, open(output_file, "w"), ensure_ascii=False, sort_keys=True, indent=2)
i think your py file have problem to process the raw dataset,such as : 当 O 希 O 望 O 工 O 程 O 救 O 助 O 的 O 百 O 万 O 儿 O 童 O 成 O 长 O 起 O 来 O , O 科 O 教 O 兴 O 国 O 蔚 O 然 O 成 O 风 O 时 O , O 今 O 天 O 有 O 收 O 藏 O 价 O 值 O 的 O 书 O 你 O 没 O 买 O , O 明 O 日 O 就 O 叫 O 你 O 悔 O 不 O 当 O 初 O ! O
藏 O 书 O 本 O 来 O 就 O 是 O 所 O 有 O 传 O 统 O 收 O 藏 O 门 O 类 O 中 O 的 O 第 O 一 O 大 O 户 O , O 只 O 是 O 我 O 们 O 结 O 束 O 温 O 饱 O 的 O 时 O 间 O 太 O 短 O 而 O 已 O 。 O so it doesn't work for raw dataset you gived ,and it also doesn't equal to the mrc fomat dataset you gived