NELSONZHAO / zhihu

This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.
https://zhuanlan.zhihu.com/zhaoyeyu
3.49k stars 2.14k forks source link

skip_grams #25

Open brucexx opened 6 years ago

brucexx commented 6 years ago

发现这块逻辑存在问题,

words_count = Counter(words) words = [w for w in words if words_count[w] > 50] In [19]:

vocab = set(words) vocab_to_int = {w: c for c, w in enumerate(vocab)} int_to_vocab = {c: w for c, w in enumerate(vocab)} In [20]: print("total words: {}".format(len(words))) print("unique words: {}".format(len(set(words)))) total words: 8623686 unique words: 6791 In [21]:

int_words = [vocab_to_int[w] for w in words]

其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置

t = 1e-5 # t值 threshold = 0.9 # 剔除概率阈值

然后这里居然用这个下标用来计算词频??有人能告诉我是什么情况

int_word_counts = Counter(int_words) total_count = len(int_words) word_freqs = {w: c/total_count for w, c in int_word_counts.items()}

prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}

对单词进行采样

train_words = [w for w in int_words if prob_drop[w] < threshold]

andrew-zzz commented 5 years ago

没认真看代码啊 vocab_to_int这玩意做了set(words)后取index作为一个onehot标识