the msra2src.py may have some problems

cooper12121 commented 2 years ago

i think your py file have problem to process the raw dataset,such as : 当 O 希 O 望 O 工 O 程 O 救 O 助 O 的 O 百 O 万 O 儿 O 童 O 成 O 长 O 起 O 来 O ， O 科 O 教 O 兴 O 国 O 蔚 O 然 O 成 O 风 O 时 O ， O 今 O 天 O 有 O 收 O 藏 O 价 O 值 O 的 O 书 O 你 O 没 O 买 O ， O 明 O 日 O 就 O 叫 O 你 O 悔 O 不 O 当 O 初 O ！ O

藏 O 书 O 本 O 来 O 就 O 是 O 所 O 有 O 传 O 统 O 收 O 藏 O 门 O 类 O 中 O 的 O 第 O 一 O 大 O 户 O ， O 只 O 是 O 我 O 们 O 结 O 束 O 温 O 饱 O 的 O 时 O 间 O 太 O 短 O 而 O 已 O 。 O so it doesn't work for raw dataset you gived ,and it also doesn't equal to the mrc fomat dataset you gived

dsuanya commented 1 year ago

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

cooper12121 commented 1 year ago

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

dsuanya commented 1 year ago

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

Anyway,Thank you for your reply!I

algorithm007 commented 1 year ago

def convert_file(input_file, output_file, tag2query_file):
    """
    Convert MSRA raw data to MRC format
    """
    origin_count = 0
    new_count = 1
    tag2query = json.load(open(tag2query_file))
    mrc_samples = []
    contexts, labels = [], []
    with open(input_file) as fin:
        for line in fin:
            line = line.strip()
            if line:
                context, label = line.split(" ")
                contexts.append(context)
                labels.append(label)
            else:

                tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(contexts, labels)])

    #         tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(src.split(), labels.split())])
                for label, query in tag2query.items():
                    start_position = [tag.begin for tag in tags if tag.tag == label]
                    end_position = [tag.end-1 for tag in tags if tag.tag == label]
                    mrc_samples.append(
                        {
                            "qas_id": "{}.{}".format(origin_count, new_count),
                            "context": " ".join(contexts),
                            "start_position": start_position,
                            "end_position": end_position,
                            "query": query,
                            "impossible": False if start_position and end_position else True,
                            "entity_label": label,
                            "span_position": [f"{start};{end}" for start, end in zip(start_position, end_position)]
                        }
                    )
                    new_count += 1

                contexts, labels = [], []
                origin_count += 1
                new_count = 1

    json.dump(mrc_samples, open(output_file, "w"), ensure_ascii=False, sort_keys=True, indent=2)

ShannonAI / mrc-for-flat-nested-ner

the msra2src.py may have some problems #120