Closed janheinrichmerker closed 5 months ago
Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data.
Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up.
@seanmacavaney I think, now everything is ready. We just have the B category subset and therefore cannot test the A and L categories. Therefore, I've temporarily hidden those IDs, until we get a chance to test that as well (the parsers are ready for those already).
Just applied some minor fixes for earlier Python versions.
Awesome, thanks! A few other nits:
Thinking a bit more about this... I sorta feel that the primary case will be the _Txt
version. Might it make sense to have the alternate formats as separate datasets, like so:
Provides txt data
Provides html
Provides vdom
Provides links
I don't think this is so far off from what's there now. It should also be faster for the primary case, since it won't need to pull a bunch of data from multiple sources and collate it.
I disagree about text being the default use case. Users will probably expect the types listed in the table from the ClueWeb22 website and paper to be present in the ir_datasets
records.
So, clueweb22/l
should have text, clueweb22/a
should have HTML, and clueweb22/b
should have screenshots (at least once they're released).
And the Touché 2023 shared task is a good example here. Participants can use whatever they want from the B category.
In my opinion, having text documents as the default in ir_datasets
would lead to confusions here.
But you're right about the performance and that's exactly the use case I imagined for the "subset views". If I only need the text, then I can "view" only the text part of clueweb22/b
, for instance.
What about renaming them like this:
clueweb22
clueweb22/b
clueweb22/b/de
clueweb22/b/en
clueweb22/b/zh
clueweb22/b/other-languages
clueweb22/b/no-png
clueweb22/b/no-png/no-html
clueweb22/a
clueweb22/a/de
clueweb22/a/en
clueweb22/a/zh
clueweb22/a/other-languages
clueweb22/a/no-html
clueweb22/l
clueweb22/l/de
clueweb22/l/en
clueweb22/l/zh
clueweb22/l/other-languages
I think this way it is more explicit that the clueweb22/b/no-png
subset view would actually exclude data that is normally part of the category B subset.
Allright, I'll just backport the @cached_property
bits then.
And it should be fine to defer the version compatibility check (and hence expecting the directory) to creating the doc iterator.
Consider it done :wink:
I see your points and I think I agree with some of them. I could probably be convinced. However, let me make a more complete case in favor of a text-only default:
wapo
, cord19
, etc. Lots of rich structured data, but it's almost always ignored in favour of a simple text format.clueweb22/vdom
etc, so they're still there and easy for folks to use if they want them. To me, it seems more straightforward to ask for what data you want rather than what data you don't want.clueweb22/d2q
namespace, rather than always loading it for every record.cord19
is an example where we made a similar decision -- most folks work with the title+abstract text, which is easy and fast to load. There's a separate cord19/fulltext
that includes the full article text, which is considerably more expensive to load (and would otherwise just be tossed out by most users).
To refute some of your points:
cord19
and therefore read the documentation.clueweb22/b/touche-2023
ID wouldn't work because we don't want to restrict participants to just use plain text. The clueweb22/b/html/touche-2023
ID wouldn't work either because then participants couldn't use the VDOM etc.Fixed the issues with version assertions and @cached_property
.
Thanks!
Looks like there are still some py36 incompatibilities: ImportError: cannot import name 'Final' from 'typing'
.
My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it.
My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks.
Well, with the current approach it is already easy (just use clueweb22/b/text
instead of clueweb22/b
) and optimized (for clueweb22/b/text
we would only look at the text files, no WARC is touched).
So why is it a problem to have users explicitly choose clueweb22/b/text
if they only care about the text?
I'm now going to test everything with a Python 3.6 interpreter, just to be sure.
I'd like to add that there are also datasets in ir_datasets
where the derived datasets are a suffix to the original dataset:
argsme/2020-04-01/processed
is derived from argsme/2020-04-01
clueweb12/touche-2022-task-2/expanded-doc-t5-query
is derived from clueweb12/touche-2022-task-2
cord19
is derived from cord19/fulltext
So I don't see a general pattern for preferring shorter IDs for the "only text"-version.
That should have been the last few 3.7-incompatible things.
Awesome, thanks!
Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions?
These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:
[INFO] [starting] first 100k docs, just text
100000it [00:07, 12524.57it/s]
[INFO] [finished] first 100k docs, just text [8.06s]
[INFO] [starting] first 100k docs, with html, txt, vdom, inlink, outlink
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: txt URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL mismatch for clueweb22-de0000-00-13406: outlink URL was https://www.jovanovic.com/quotidien.htm but html URL was https://www.jovanna.de/
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: outlink URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: txt URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: inlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: inlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: outlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: outlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
100000it [03:04, 541.70it/s]
[INFO] [finished] first 100k docs, with html, txt, vdom, inlink, outlink [03:05]
As expected parsing the WARC files is 22x slower than just reading the JSONL file.
Great news, my copy of the CW22 drive arrived.
Great to hear that!
I've updated the branch to reflect upstream changes and added default_text()
implementations.
Is anything still blocking the merge?
Sorry -- the only thing blocking is finding the time to run through the tests on my end.
Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to ir_datasets
. If there is anything I can help with, please let me know.
Closing this PR in favor of the new ir-datasets-clueweb22
extension.
I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22. Of course, the implementation is far from finished (as you can see by the numerous todo's :laughing:). But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.
And of course, this PR would close #210 :wink: