TREC CaST - Githubissues

bpiwowar commented 5 months ago

This pull request is for TREC CaST 2019 to 2022

Generic classes

For the moment, it contains generic handler classes that might be moved in other places (and need to be tested):

PrefixedDocs that allows to use document collections that are merged using a collection-specific prefix to the ID of each document
DocsSubset and Dupes: handle duplicates in any document collection

Cleanup before merge

I suggest the following steps before merging with master:

[x] WAPO in v3 (2021) has no clearly associated duplicate file (as it is the case in 2022), so we have two nearly identical collections. Is this right?
[x] Decide where to put the file that computes the offsets (in case it is needed later)
[x] Test the generic classes
- [x] Test prefixed documents
- [x] Test subsets
[x] Use URLs for offset files in downloads.json (so move them)
[x] Set the real number of documents

seanmacavaney commented 5 months ago

Great, thanks!

I won't have much time to contribute directly to this for the next week or so.

Regarding the tasks:

WAPO in v3 (2021) has no clearly associated duplicate file (as it is the case in 2022), so we have two nearly identical collections. Is this right?

This is my understanding, yes. We can prefix each collection by year.

Decide where to put the file that computes the offsets (in case it is needed later)

My vote is to put it on huggingface, and potentially include a backup location too. I can add it under https://huggingface.co/irds.

bpiwowar commented 5 months ago

OK this is done (uploaded offset files on HF).

I see you support python 3.7, is it still ongoing or are you planning to switch to 3.8 ?

seanmacavaney commented 5 months ago

Given that 3.7's reached End of Service, I think it's reasonable to bump the minimum Python version to 3.8. Especially if there's features from the core library that you want to use from 3.8.

bpiwowar commented 5 months ago

OK great, should I modify the CI myself or will you do it in the main branch

Remaining quick question (for now): should I move the generic classes outside of trec_cast, and if so, where (see first post in PR)?

I will test the PR them a bit thoroughly by testing my models on TREC CaST so no emergency

seanmacavaney commented 5 months ago

The generic classes can move outside the cast directory. I'm keen to apply them in other settings.

seanmacavaney commented 5 months ago

You can modify the CI in this branch, that's alright.

bpiwowar commented 5 months ago

Can you give "write" access temporarily to https://huggingface.co/irds (so I can transfer the ownership) or otherwise copy the files from https://huggingface.co/datasets/bpiwowar/trec_cast_offsets/tree/main

seanmacavaney commented 5 months ago

Can you give "write" access temporarily

Done!

seanmacavaney commented 4 months ago

I'm having trouble running some of the tests locally:

/opt/miniconda3/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test/util/docs/test_subset.py:2: in <module>
    from .data import FakeDocs
E   ModuleNotFoundError: No module named 'docs.data'

Is there a missing test/util/docs/data.py file?

bpiwowar commented 4 months ago

Yep – pushed with the missing file

Le 15 févr. 2024 à 20:34, Sean MacAvaney @.***> a écrit :

I'm having trouble running some of the tests locally:

/opt/miniconda3/lib/python3.10/importlib/init.py:126: in import_module return _bootstrap._gcd_import(name[level:], package, level) test/util/docs/test_subset.py:2: in from .data import FakeDocs E ModuleNotFoundError: No module named 'docs.data' Is there a missing test/util/docs/data.py file?

— Reply to this email directly, view it on GitHub https://github.com/allenai/ir_datasets/pull/255#issuecomment-1947095252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHLEW7E67ADMCYMLYJUOTYTZPLRAVCNFSM6AAAAABCPPEWJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXGA4TKMRVGI. You are receiving this because you authored the thread.

seanmacavaney commented 4 months ago

Great, thanks!

seanmacavaney commented 4 months ago

Sorry this is taking a bit. Supporting lookups by index is a bit tricky with the prefixing approach.

bpiwowar commented 4 months ago

Yes, but it avoids building docstores in those cases. Feel free to change this behavior if you think this is better (or maybe this could be a optional?)

seanmacavaney commented 4 months ago

I'm a bit curious -- do you have a specific use care for breaking apart the 3 sub-datasets?

I see the advantage in letting them be reused across years of the task, saving some storage. But it also adds complexity compared to merging them all into one.

bpiwowar commented 4 months ago

Not really – apart from the fact that all three datasets can be used in other experiments, and using a soft-merge allows to re-use these docstores rather than creating a new one (and they are quite big datasets already).

Le 19 févr. 2024 à 10:59, Sean MacAvaney @.***> a écrit :

I'm a bit curious -- do you have a specific use care for breaking apart the 3 sub-datasets?

I see the advantage in letting them be reused across years of the task, saving some storage. But it also adds complexity compared to merging them all into one.

— Reply to this email directly, view it on GitHub https://github.com/allenai/ir_datasets/pull/255#issuecomment-1952093375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHLESXTBGM4Z733KER74DYUMPAXAVCNFSM6AAAAABCPPEWJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJSGA4TGMZXGU. You are receiving this because you authored the thread.

bpiwowar commented 4 months ago

Tell me if you want the docstore for prefix-based unions to be materialized or not – I tend to favour the "not" (especially since this is a shared class where we can allow for some complexity) but I would be happy with both (especially since most people will use the passage version anyways). Another possibility is to have a default setting that could be changed.

seanmacavaney commented 4 months ago

I think I'm leaning toward materializing the unions, on the basis of reducing the complexity.

If we're doing that, the dataset becomes a relatively routine one -- an iterator over the sources in the base case (performing passaging from the offset files), that's materialized as a combined docstore if the user performs a lookup by index or doc_id.

Although the storage costs are considerable (~80 GB), it's also not a crazy amount. The biggest drawback is that each year ends up with a variation of the corpus, which will add the storage cost for each version.

bpiwowar commented 4 months ago

I modified the code so the LZ4 store is built (I kept the "on-the-fly" code though since for simple cases, where only a prefix is added, this might still be useful)

seanmacavaney commented 4 months ago

Thanks! I'm building the tests and such now :)

seanmacavaney commented 4 months ago

I think it's mostly there!

I'm only having some problems with v2. First, when I try to do a lookup, it fails in DocsSubset.docs_count, I think since the docs_count() isn't known for a filtered dataset. Second, although I'm not super familiar with CAST, I think the v2 corpus return individual passages instead of entire documents? [specification here] Otherwise the qrels won't align with the doc_ids returned.

bpiwowar commented 4 months ago

Great!

For v2 (if I understood correctly), assessments are at the document level - even though there is an official passage corpus. I added a v2/passages that concatenates all the passages from the three datasets.

Can you give me the command for the lookup you are testing with?

andreaschari commented 1 month ago

Ran both integration tests and the tests in util/docs. I have 2 out of 3 util/docs tests failing:

FAILED test/util/docs/test_multiple.py::test_multiple_prefixes - NameError: name 'OtherDoc' is not defined

Self-explanatory, the OtherDoc being set as docs_cls in line 47 of test_multiple.py isn't defined anywhere in the test code.

FAILED test/util/docs/test_multiple.py::test_multiple_prefixes_inlined - AttributeError: 'NoneType' object has no attribute '_docs_cls'

Error's coming from line 74: assert all_docs.docs_cls() == GenericDoc

As for the integration tests:

FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None

bpiwowar commented 1 month ago

OK, the integration tests should be fixed now

andreaschari commented 1 month ago

The tests in test_multiple.py are passing now! But still not passing TestTrecCast::test_queries:

FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None

This is the entire error trace:

test/integration/trec_cast.py:61: 
test/integration/base.py:61: in _test_queries
    self._assert_namedtuple(query, items[i])                                                                                                     
test/integration/base.py:266: in _assert_namedtuple  
    self.assertEqual(v_a, v_b) 
E   AssertionError: '' != None

bpiwowar commented 1 month ago

The tests should run now

seanmacavaney commented 1 month ago

Thanks @bpiwowar and @andreaschari !

allenai / ir_datasets

TREC CaST #255

Generic classes

Cleanup before merge