Closed bpiwowar closed 1 month ago
Great, thanks!
I won't have much time to contribute directly to this for the next week or so.
Regarding the tasks:
WAPO in v3 (2021) has no clearly associated duplicate file (as it is the case in 2022), so we have two nearly identical collections. Is this right?
This is my understanding, yes. We can prefix each collection by year.
Decide where to put the file that computes the offsets (in case it is needed later)
My vote is to put it on huggingface, and potentially include a backup location too. I can add it under https://huggingface.co/irds.
OK this is done (uploaded offset files on HF).
I see you support python 3.7, is it still ongoing or are you planning to switch to 3.8 ?
Given that 3.7's reached End of Service, I think it's reasonable to bump the minimum Python version to 3.8. Especially if there's features from the core library that you want to use from 3.8.
OK great, should I modify the CI myself or will you do it in the main branch
Remaining quick question (for now): should I move the generic classes outside of trec_cast
, and if so, where (see first post in PR)?
I will test the PR them a bit thoroughly by testing my models on TREC CaST so no emergency
The generic classes can move outside the cast directory. I'm keen to apply them in other settings.
You can modify the CI in this branch, that's alright.
Can you give "write" access temporarily to https://huggingface.co/irds (so I can transfer the ownership) or otherwise copy the files from https://huggingface.co/datasets/bpiwowar/trec_cast_offsets/tree/main
Can you give "write" access temporarily
Done!
I'm having trouble running some of the tests locally:
/opt/miniconda3/lib/python3.10/importlib/__init__.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
test/util/docs/test_subset.py:2: in <module>
from .data import FakeDocs
E ModuleNotFoundError: No module named 'docs.data'
Is there a missing test/util/docs/data.py
file?
Yep – pushed with the missing file
Le 15 févr. 2024 à 20:34, Sean MacAvaney @.***> a écrit :
I'm having trouble running some of the tests locally:
/opt/miniconda3/lib/python3.10/importlib/init.py:126: in import_module return _bootstrap._gcd_import(name[level:], package, level) test/util/docs/test_subset.py:2: in
from .data import FakeDocs E ModuleNotFoundError: No module named 'docs.data' Is there a missing test/util/docs/data.py file? — Reply to this email directly, view it on GitHub https://github.com/allenai/ir_datasets/pull/255#issuecomment-1947095252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHLEW7E67ADMCYMLYJUOTYTZPLRAVCNFSM6AAAAABCPPEWJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXGA4TKMRVGI. You are receiving this because you authored the thread.
Great, thanks!
Sorry this is taking a bit. Supporting lookups by index is a bit tricky with the prefixing approach.
Yes, but it avoids building docstores in those cases. Feel free to change this behavior if you think this is better (or maybe this could be a optional?)
I'm a bit curious -- do you have a specific use care for breaking apart the 3 sub-datasets?
I see the advantage in letting them be reused across years of the task, saving some storage. But it also adds complexity compared to merging them all into one.
Not really – apart from the fact that all three datasets can be used in other experiments, and using a soft-merge allows to re-use these docstores rather than creating a new one (and they are quite big datasets already).
Le 19 févr. 2024 à 10:59, Sean MacAvaney @.***> a écrit :
I'm a bit curious -- do you have a specific use care for breaking apart the 3 sub-datasets?
I see the advantage in letting them be reused across years of the task, saving some storage. But it also adds complexity compared to merging them all into one.
— Reply to this email directly, view it on GitHub https://github.com/allenai/ir_datasets/pull/255#issuecomment-1952093375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHLESXTBGM4Z733KER74DYUMPAXAVCNFSM6AAAAABCPPEWJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJSGA4TGMZXGU. You are receiving this because you authored the thread.
Tell me if you want the docstore for prefix-based unions to be materialized or not – I tend to favour the "not" (especially since this is a shared class where we can allow for some complexity) but I would be happy with both (especially since most people will use the passage version anyways). Another possibility is to have a default setting that could be changed.
I think I'm leaning toward materializing the unions, on the basis of reducing the complexity.
If we're doing that, the dataset becomes a relatively routine one -- an iterator over the sources in the base case (performing passaging from the offset files), that's materialized as a combined docstore if the user performs a lookup by index or doc_id.
Although the storage costs are considerable (~80 GB), it's also not a crazy amount. The biggest drawback is that each year ends up with a variation of the corpus, which will add the storage cost for each version.
I modified the code so the LZ4 store is built (I kept the "on-the-fly" code though since for simple cases, where only a prefix is added, this might still be useful)
Thanks! I'm building the tests and such now :)
I think it's mostly there!
I'm only having some problems with v2. First, when I try to do a lookup, it fails in DocsSubset.docs_count
, I think since the docs_count()
isn't known for a filtered dataset. Second, although I'm not super familiar with CAST, I think the v2 corpus return individual passages instead of entire documents? [specification here] Otherwise the qrels won't align with the doc_ids returned.
Great!
For v2 (if I understood correctly), assessments are at the document level - even though there is an official passage corpus. I added a v2/passages
that concatenates all the passages from the three datasets.
Can you give me the command for the lookup you are testing with?
Ran both integration tests and the tests in util/docs
.
I have 2 out of 3 util/docs
tests failing:
FAILED test/util/docs/test_multiple.py::test_multiple_prefixes - NameError: name 'OtherDoc' is not defined
Self-explanatory, the OtherDoc
being set as docs_cls
in line 47
of test_multiple.py
isn't defined anywhere in the test code.
FAILED test/util/docs/test_multiple.py::test_multiple_prefixes_inlined - AttributeError: 'NoneType' object has no attribute '_docs_cls'
Error's coming from line 74
: assert all_docs.docs_cls() == GenericDoc
As for the integration
tests:
FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None
OK, the integration tests should be fixed now
The tests in test_multiple.py
are passing now! But still not passing TestTrecCast::test_queries
:
FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None
This is the entire error trace:
test/integration/trec_cast.py:61:
test/integration/base.py:61: in _test_queries
self._assert_namedtuple(query, items[i])
test/integration/base.py:266: in _assert_namedtuple
self.assertEqual(v_a, v_b)
E AssertionError: '' != None
The tests should run now
Thanks @bpiwowar and @andreaschari !
This pull request is for TREC CaST 2019 to 2022
Generic classes
For the moment, it contains generic handler classes that might be moved in other places (and need to be tested):
PrefixedDocs
that allows to use document collections that are merged using a collection-specific prefix to the ID of each documentDocsSubset
andDupes
: handle duplicates in any document collectionCleanup before merge
I suggest the following steps before merging with master:
downloads.json
(so move them)