alasdairtran / transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning
https://transform-and-tell.ml/
91 stars 14 forks source link

Question about the "rare proper nouns" in the paper. #44

Open Neowyh opened 2 years ago

Neowyh commented 2 years ago

Hi Alasdair,

I'm curious about the "rare proper nouns" mentioned in your CVPR paper, which are described as ".. nouns that appear in a test caption but not in any training caption. " And I was wondering if I could ask some questions: 1) Are the "rare proper nouns" proper nouns extracted by a certain toolkit like named entities (using spacy). If so, how do the "rare proper nouns" extracted? 2) Is there any difference between the "rare proper nouns" and "named entities" except that the former is "rare"? 3) The "rare proper nouns" do not appear in any training caption, but are they possible to exist in training or testing news articles?

Thanks very much!

alasdairtran commented 2 years ago

how do the "rare proper nouns" extracted?

We first use spacy to extract proper nouns. You can see the actual get_proper_nouns function we use here. Then, we define "rare proper noun" to be proper nouns that appear in a test caption but not in any training caption (note that we only look at captions and not actual article content).

Is there any difference between the "rare proper nouns" and "named entities" except that the former is "rare"?

You can check out our get_entities function here. We use the NER from spacy. I believe that proper nouns and named entities are similar but not completely overlapping concepts. For example, $1 billion is a named entity (MONEY) but not a proper noun.

The "rare proper nouns" do not appear in any training caption, but are they possible to exist in training or testing news articles?

Yes. Since we only process the captions to do the classification, it is possible that a rare proper noun is not present in any training caption but might have appeared inside a training article context.