facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.44k stars 925 forks source link

How to extract vocabulary for TextVQA or ST-VQA for M4C model? #1323

Closed soonchangAI closed 7 months ago

soonchangAI commented 7 months ago

❓ Questions and Help

Is there a script for extracting vocabulary for M4C ? In M4C paper, it states

We collect the top 5000 frequent words from the answers in the training set as our answer vocabulary.

I try to create my own vocab by using split() But, I found my created vocab contains some different words such as it. , oxygen? , smucker's

when compare to the provided vocab. It seems I miss out some rules.

pbontrager commented 7 months ago

If you look in the TextVQA config, you can see that answers are processed with the "simple_word" tokenizer found here. Also the vocabulary should be in textvqa/defaults/extras/vocabs/fixed_answer_vocab_textvqa_5k.txt

soonchangAI commented 7 months ago

Thanks, @pbontrager I have tried it, but there are still several words difference. I think it's probably due to 10 ground truth answers available. The following are my code:

` def word_tokenize(word, remove=None): if remove is None: remove = [",", "?"] word = word.lower()

for item in remove:
    word = word.replace(item, "")
word = word.replace("'s", " 's")

return word.strip()

answers = [] for i in range(1,len(imdb)): words = imdb[i]['answers'][0].split()

ans_word = []
for word in words:
    ans_word.append(word_tokenize(word))
clean_word = []
for w in ans_word:
    clean_word += w.split()
answers+= clean_word

unique = list(set(answers)) print(len(unique)) word_count = {} for word in unique: word_count[word] = answers.count(word)

sort_word_count = {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1])} freq_words = list(sort_word_count.keys())[-5000:] `