Closed soonchangAI closed 7 months ago
Thanks, @pbontrager I have tried it, but there are still several words difference. I think it's probably due to 10 ground truth answers available. The following are my code:
` def word_tokenize(word, remove=None): if remove is None: remove = [",", "?"] word = word.lower()
for item in remove:
word = word.replace(item, "")
word = word.replace("'s", " 's")
return word.strip()
answers = [] for i in range(1,len(imdb)): words = imdb[i]['answers'][0].split()
ans_word = []
for word in words:
ans_word.append(word_tokenize(word))
clean_word = []
for w in ans_word:
clean_word += w.split()
answers+= clean_word
unique = list(set(answers)) print(len(unique)) word_count = {} for word in unique: word_count[word] = answers.count(word)
sort_word_count = {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1])} freq_words = list(sort_word_count.keys())[-5000:] `
❓ Questions and Help
Is there a script for extracting vocabulary for M4C ? In M4C paper, it states
I try to create my own vocab by using
split()
But, I found my created vocab contains some different words such asit. , oxygen? , smucker's
when compare to the provided vocab. It seems I miss out some rules.