IAB version of uploaded dataset

YipingNUS / contextual-eval-dataset

Evaluation Dataset for "Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia" (TextGraphs-15 Workshop@NAACL 2021)

GNU General Public License v2.0

4 stars 0 forks source link

IAB version of uploaded dataset #2

Open thefirebanks opened 2 years ago

thefirebanks commented 2 years ago

Hey @YipingNUS ! Excellent work here. I was looking at the IAB content taxonomy website and I see that they have released up to version 3.0. When I look at old versions (say 2.0), the number of tier 1 categories is around 35 and 560 for tier 2. However, in your paper (and in the data), there are only 23 tier 1 categories and 354 tier 2 categories.

Which version of the IAB taxonomy did you use for this data? 1.0? When I look at the website, it says that version 1.0 is deprecated, but if this is the case, do you know if there's a mapping between the categories in version 1.0 and 2.0? Thanks!

thefirebanks commented 2 years ago

Also, just wanted to confirm a couple more things:

The size of the training dataset is supposed to be 1.16 million labeled documents. But when I got all the unique file names for each of the subdirectories, I get 871,069. Is this expected?
The evaluation dataset would be this file? https://github.com/YipingNUS/contextual-eval-dataset/blob/main/iab-tier-2-sample.tar.gz

YipingNUS commented 2 years ago

Hi @thefirebanks, it's following IAB v1. Migrating to v2 is a huge effort and we have a small team. So we aborted the effort. I remember in the v2 release CSV file, they had (partial) mapping to v1 categories.

Some documents are assigned multiple labels so the total number of unique documents is smaller than 1.16M.

Yes, the tgz file is the sample evaluation dataset. Sry I can't provide the full eval set.

thefirebanks commented 2 years ago

Got it, thank you!!! I can confirm that there is a mapping in the second sheet of the v2 Taxonomy excel file