different preprocessor for text_left and text_right

sth4k commented 5 years ago

Hi I found a problem when using the duet model. Basically after basic_preprocessor, the text_left and text_right vectors are not the same for two same text (exact match). I read through the code, and found that, in basic_preprocessor.py, the filter_unit is applied to text_right. https://github.com/NTMC-Community/MatchZoo/blob/2547f9d1b302d0f166508ba39fa659dfa210a276/matchzoo/preprocessors/basic_preprocessor.py#L129-L130

By doing so, some longtail words will be removed for text_right, but not for text_left. so when the vocab_unit is applied, there will be "0" index for those longtail words in text_left. Just give a dummy example, the vector for text_left will be [12,3,0,12,0,0] and text_right will be [12,3,12,0,0,0]. May I know why the filter_unit is applied on text_right only? My understanding is, if the model is for search purpose where the query (text_left) is usually short and informative, there is no need to apply filter_unit. However, if this library is for text matching, there might be a problem. Because I do observe the exact match accuracy is not good in my experiments.

bwanglzu commented 5 years ago

@uduse @faneshion take a look?

uduse commented 5 years ago

@faneshion This might still worth investigating.

ShenCastle commented 3 years ago

我也遇到了相同的问题，比如这句 “民间借贷方式有哪些”，将jieba分词后的结果作为text_left就会是“民间/借贷/方式/有/哪些”，作为text_right则为“有/哪些”，请问这个应该怎么解决

ShenCastle commented 3 years ago

@bwanglzu

ShenCastle commented 3 years ago

@faneshion Can you help me solve this problem?

NTMC-Community / MatchZoo

different preprocessor for text_left and text_right #727