char span和token span分别指什么 - Githubissues

131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

char span和token span分别指什么 #53

Open macheng6 opened 3 years ago

LimKim commented 3 years ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

macheng6 commented 3 years ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

对于中文来说，二者不应该一样吗，中文的bertTokenizer也是基于字符的吧

LimKim commented 3 years ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

对于中文来说，二者不应该一样吗，中文的bertTokenizer也是基于字符的吧

全部中文的话，差不多是一样的。不过你不要考虑这些，你把char span设置好，在preprocess里的BuildData代码里会自动帮你生成tok span

macheng6 commented 3 years ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

对于中文来说，二者不应该一样吗，中文的bertTokenizer也是基于字符的吧

全部中文的话，差不多是一样的。不过你不要考虑这些，你把char span设置好，在preprocess里的BuildData代码里会自动帮你生成tok span

看不懂代码，然后直接跑感觉不得劲儿

131250208 commented 3 years ago

@macheng6 中文里也会出现英文单词

macheng6 commented 3 years ago

@macheng6 中文里也会出现英文单词

嗯嗯，刚仔细看了一下，确实是这样，有char，token和ent（或者称为span）三层结构。还有我发现一个小bug，如果一个句子的第一个span就是ent，并且ent的第一个字符是空格，后面token会出现-1的情况，utils的310的那个函数

131250208 commented 3 years ago

@macheng6 预处理去掉句子和实体首尾的空格就行了，这些空格是不合法的

Wonderson-wpp commented 2 years ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

请问，token_span要怎么得到呢？我想将模型应用到我自己标注的小数据集中，但是通过标注只能直接获得char span,这个token span不知道如何处理得到我的意思是，在ori_data中的训练数据，需要把实体和关系中的tok_span标记出来吗？如果要，要如何得到呢？

131250208 commented 2 years ago

@Wonderson-wpp 如果你认真看了这个issue的讨论，你应该已经知道答案了。在 @LimKim 的回答里

xxllp commented 2 years ago

为啥我本地的还是很多-1的情况

lzh1998-jansen commented 1 year ago

char span是基于字符的实体start和end位置 token span是经过bert tokenizer后，实体所在的新的start和end位置

对于中文来说，二者不应该一样吗，中文的bertTokenizer也是基于字符的吧

请问如何用一个中文数据集，去转换成 tplinker 格式的数据集，您能给个例子吗，我用百度关系抽取大赛数据集执行builddata.py报错了