allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

HC4 #148

Closed eugene-yang closed 2 years ago

eugene-yang commented 2 years ago

Dataset Information:

The HC4 collection that is accepted at ECIR 2022

**Links to Resources:** https://github.com/hltcoe/HC4/tree/main/resouces/hc4 **Dataset ID(s) & supported entities:** - Dataset ID: hc4/{language id: zh, fa, ru}/{train, dev, test} - Will have {Chinese, Farsi, Russian} documents, English queries(title/description/narrative), an English report associated with each topic and qrels. **Checklist** Mark each task once completed. All should be checked prior to merging a new dataset. - [x] Dataset definition (in `ir_datasets/datasets/[topid].py`) - [x] Tests (in `tests/integration/[topid].py`) - [x] Metadata generated (using `ir_datasets generate_metadata` command, should appear in `ir_datasets/etc/metadata.json`) - [x] Documentation (in `ir_datasets/etc/[topid].yaml`) - [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/ - [x] Downloadable content (in `ir_datasets/etc/downloads.json`) - [x] Download verification action (in `.github/workflows/verify_downloads.yml`). Only one needed per `topid`. - [x] ~~Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in `downloads.json`.~~ **Additional comments/concerns/ideas/etc.** The document id, qrels, and topics will be distributed through a public github repository. Users need to download the actual documents through Common Crawl. Script for downloading and validating will be provided along with the doc ids. The structure will be very similar to the future NeuCLIR collection. Whether these two collections will be distributed through the same repository is TBD.
seanmacavaney commented 2 years ago

For the IDs, I suggest:

hc4 # empty top-level placeholder --- or a concatenation of all three corpora? (Would that be useful?)
hc4/{zh,fa,ru} # docs only (for the specified language)
hc4/{zh,fa,ru}/{train,dev,test} # docs (inherited from above), queries&qrels (for particular subset)

What's the nature of the report that's associated with each query? (e.g., how long is it?)

eugene-yang commented 2 years ago

Thanks, Sean. This makes sense. The top-level should be just placeholder. At this point, I don't think combining all three languages makes sense. Even though there are some topics that span across languages (i.e. same title and description), but the narratives are different. So they should be considered different queries.

Reports range from 1 to 5 paragraphs. Conceptually, they are written by the analysts prior to the search to reflect some background of the information need.

eugene-yang commented 2 years ago

Forgot to mention, we also have the human and machine translation of the title and descriptions. All titles and descriptions have MT translations in all 3 languages. Human translation are only available for the language we that we have qrels for. For example, a topic 1 is available for Chinese and Farsi but not Russian, its title and description would come with English, human translated Chinese, human translated Farsi, MT translated Chinese, Farsi and Russian.

I am thinking about something like this and leave fields that are not available as None. But this looks tedious and not really robust.

class HC4Query(NamedTuple):
    query_id: str
    title_en: str
    ht_title_zh: str
    ht_title_fa: str
    ht_title_ru: str
    mt_title_zh: str
    mt_title_fa: str
    mt_title_ru: str
    # and other stuff

Alternatively, we could make it another level in the dataset id. So something like this --

hc4 # empty top-level placeholder --- or a concatenation of all three corpora? (Would that be useful?)
hc4/{zh,fa,ru} # docs only (for the specified language)
hc4/{zh,fa,ru}/{train,dev,test} # docs (inherited from above), qrels (for particular subset)
hc4/{zh,fa,ru}/{train,dev,test}/{org,ht,mt} # queries with different languages/sources

But this structure will lose the ability to provide MT queries that don't come with qrels (i.e. the MT translated title/descriptions in Russian for topic 1 in the example above).

What do you think, @seanmacavaney?

seanmacavaney commented 2 years ago

There's always something new!

What would you think about having it a dict with the keys as the language code? E.g.,

class HC4Query(NamedTuple):
    query_id: str
    title: str # can we omit en from the title? Maybe it can be assumed since it's a CLIR task
    ht_titles: Dict[str, str] # e.g., {"zh": "XXX", "fa": "YYY"}
    mt_titles: Dict[str, str]
    # and other stuff

Pros:

Cons:

But wait, isn't this already under a particular language? E.g., hc4/zh/train. What would be the reason to have a english-to-farsi query translation under a chinese corpus? Am I misunderstanding HC4 -- is it a combined corpus?

eugene-yang commented 2 years ago

I personally is leaning toward keeping the the flat structure to allow more/easier adaptions. How about a middle ground --

class HC4Query(NamedTuple):
    query_id: str
    title: str # can we omit en from the title? Maybe it can be assumed since it's a CLIR task
    ht_titles: Dict[str, str] # e.g., {"zh": "XXX", "fa": "YYY"}
    mt_titles: Dict[str, str]
    # and other stuff

   def __getattr__(self, key):
       key = key.split("_") # e.g. "ht_zh_titles"
       if key[0] == 'ht':
           return self.ht_titles[key[1]]
       elif key[0] == 'mt':
           return self_mt_titles[key[1]]
       else:
           raise AttributeError

What would be the reason to have a english-to-farsi query translation under a chinese corpus? Am I misunderstanding HC4 -- is it a combined corpus?

Probably not really some strong use cases. Our original intention is to provide them so if people really want to test Farsi-to-Chinese CLIR, there is at least a translated query that they can use. But this is really an edge case.

seanmacavaney commented 2 years ago

Probably not really some strong use cases. Our original intention is to provide them so if people really want to test Farsi-to-Chinese CLIR, there is at least a translated query that they can use. But this is really an edge case.

If that's the case, I'd say:

hc4 # empty top-level placeholder --- or a concatenation of all three corpora? (Would that be useful?)
hc4/{zh,fa,ru} # docs only (for the specified language)
hc4/{zh,fa,ru}/{train,dev,test} # docs (inherited from above), qrels (for particular subset)

With:

class HC4Query(NamedTuple):
    query_id: str
    title: str # en
    ht_title: str # whatever language we're currently under (e.g., if this is `hc4/zh/train`, query is in zh)
    mt_title: str
    # and other stuff

I think this would be the easiest to use in most cases. If you want total CLIR, use title. If you want total monolingual, use ht_title. If you want CLIR that assumes the query has already been translated, use mt_title. And it will work with whatever language you're currently using for docs.

fa-to-zh (or similar) CLIR would still be achievable if you, say, use [ht|mt]_title from hc4/fa/train with docs and qrels from hc4/zh/train. Which isn't so bad, especially for an edge case that you'd really have to know what you were doing anyway for.

eugene-yang commented 2 years ago

fa-to-zh (or similar) CLIR would still be achievable if you, say, use [ht|mt]_title from hc4/fa/train with docs and qrels from hc4/zh/train. Which isn't so bad, especially for an edge case that you'd really have to know what you were doing anyway for.

So the issue with this is that the set of topics in hc4/fa/train would be different than hc4/zh/train. So let's say topic 2 is exclusively judged in Chinese subset. It will only appear in hc4/zh/train and not in hc4/fa/train. But this in fa-to-zh CLIR setting, topic 2 with MT translated queries in Farsi is exactly what we want here.

seanmacavaney commented 2 years ago

So that means they'd also have to filter down to the intersection of the topics in hc4/fa/train and hc4/zh/train?

I think I could live with that, considering that we expect this use case to be rare and the simplicity it gets us (and the users!) in the topic class.

The documentation could mention that some of the topics are aligned across languages to enable this.

eugene-yang commented 2 years ago

If that's can be achieved by a filter, I think that is good enough for us. But consider this case --

Topic 1 -- judged in Chinese
Topic 2 -- judged in Farsi
Topic 3 -- judged in both Chinese and Farsi

And in the above scheme, we will have

hc4/zh/train -- containing topic 1 and 3 with [ht|mt]_titles both Chinese
hc4/fa/train -- containing topic 2 and 3 with [ht|mt]_titles both Farsi

I don't think we want to have all topic 1, 2, and 3 appear in both the hc4/fa/train and hc4/zh/train query iterators.

In this example, if we want to do fa-to-zh (query in fa and documents in zh), we want topic 1+3(queries in hc4/zh/train with Farsi MT results but the mt_title fields will be containing Chinese MTs) because of the qrels instead of 2+3.

seanmacavaney commented 2 years ago

Yeah, that's why I was saying in this edge case, the user would have to take the intersection of qids in hc4/fa/train and hc4/zh/train -- leaving just Topic 3. They'd also have to filter the qrels, true.

eugene-yang commented 2 years ago

Agreed. And for users who really need the full capability of fa-to-zh CLIR, they can also go back to the original source to extract the queries.