for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index] # 去掉"##"
else:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
您在掩码中文词时的实现如下:
index_set是一个中文词的所有索引,但是您在这个循环里面对每个中文字都随机mask了,这和BERT里面的掩码策略感觉一样,WWM不是一个整词 要么全mask,要么全不mask吗 按照您这个实现 一个完整的中文词也可能只mask一部分 是我的理解有误吗?