allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
317 stars 42 forks source link

Clueweb22 #213

Closed janheinrichmerker closed 5 months ago

janheinrichmerker commented 1 year ago

I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22. Of course, the implementation is far from finished (as you can see by the numerous todo's :laughing:). But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.

And of course, this PR would close #210 :wink:

seanmacavaney commented 1 year ago

Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data.

janheinrichmerker commented 1 year ago

Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up.

janheinrichmerker commented 1 year ago

@seanmacavaney I think, now everything is ready. We just have the B category subset and therefore cannot test the A and L categories. Therefore, I've temporarily hidden those IDs, until we get a chance to test that as well (the parsers are ready for those already).

Added IDs:

janheinrichmerker commented 1 year ago

Just applied some minor fixes for earlier Python versions.

seanmacavaney commented 1 year ago

Awesome, thanks! A few other nits:

seanmacavaney commented 1 year ago

Thinking a bit more about this... I sorta feel that the primary case will be the _Txt version. Might it make sense to have the alternate formats as separate datasets, like so:

Provides txt data

Provides html

Provides vdom

Provides links

I don't think this is so far off from what's there now. It should also be faster for the primary case, since it won't need to pull a bunch of data from multiple sources and collate it.

janheinrichmerker commented 1 year ago

I disagree about text being the default use case. Users will probably expect the types listed in the table from the ClueWeb22 website and paper to be present in the ir_datasets records. So, clueweb22/l should have text, clueweb22/a should have HTML, and clueweb22/b should have screenshots (at least once they're released).

And the Touché 2023 shared task is a good example here. Participants can use whatever they want from the B category. In my opinion, having text documents as the default in ir_datasets would lead to confusions here.

But you're right about the performance and that's exactly the use case I imagined for the "subset views". If I only need the text, then I can "view" only the text part of clueweb22/b, for instance.

What about renaming them like this:

I think this way it is more explicit that the clueweb22/b/no-png subset view would actually exclude data that is normally part of the category B subset.

janheinrichmerker commented 1 year ago

Allright, I'll just backport the @cached_property bits then. And it should be fine to defer the version compatibility check (and hence expecting the directory) to creating the doc iterator. Consider it done :wink:

seanmacavaney commented 1 year ago

I see your points and I think I agree with some of them. I could probably be convinced. However, let me make a more complete case in favor of a text-only default:

  1. Most (existing) systems currently rely only on the text, so nearly any baseline will end up just tossing out all the other information.
  2. Let's say there's a new system that uses some of the other structured data -- let's say the vdom. This often relies on a format that's specific to the dataset itself. So a system would need to be built specifically for handling the CW22 format, which inherently won't transfer to other datasets. Not saying it won't happen, it's just less common. There's many of examples of this, e.g., wapo, cord19, etc. Lots of rich structured data, but it's almost always ignored in favour of a simple text format.
  3. There's overhead in pulling data from additional sources to create records. I don't yet have my copy, but I imagine that the text data will be much smaller than vdom/html/etc. So less data overall. Further, reading a single file sequentially from disk should be faster than reading from multiple files, which will need to jump around more.
  4. All 3 splits of CW22 include text, so it's a uniform default format across the splits.
  5. The other formats will be available through their respective clueweb22/vdom etc, so they're still there and easy for folks to use if they want them. To me, it seems more straightforward to ask for what data you want rather than what data you don't want.
  6. Alternative formats could be added more easily that fit nicely into this structure. E.g., if somebody runs a doc2query over all the documents, it could be added under a new clueweb22/d2q namespace, rather than always loading it for every record.
  7. Would probably simplify the code, if there's no need for composite records, merging, etc.

cord19 is an example where we made a similar decision -- most folks work with the title+abstract text, which is easy and fast to load. There's a separate cord19/fulltext that includes the full article text, which is considerably more expensive to load (and would otherwise just be tossed out by most users).

janheinrichmerker commented 1 year ago

To refute some of your points:

janheinrichmerker commented 1 year ago

Fixed the issues with version assertions and @cached_property.

seanmacavaney commented 1 year ago

Thanks!

Looks like there are still some py36 incompatibilities: ImportError: cannot import name 'Final' from 'typing'.

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it.

janheinrichmerker commented 1 year ago

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks.

Well, with the current approach it is already easy (just use clueweb22/b/text instead of clueweb22/b) and optimized (for clueweb22/b/text we would only look at the text files, no WARC is touched).

So why is it a problem to have users explicitly choose clueweb22/b/text if they only care about the text?

I'm now going to test everything with a Python 3.6 interpreter, just to be sure.

janheinrichmerker commented 1 year ago

I'd like to add that there are also datasets in ir_datasets where the derived datasets are a suffix to the original dataset:

So I don't see a general pattern for preferring shorter IDs for the "only text"-version.

janheinrichmerker commented 1 year ago

That should have been the last few 3.7-incompatible things.

seanmacavaney commented 1 year ago

Awesome, thanks!

seanmacavaney commented 1 year ago

Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions?

janheinrichmerker commented 1 year ago

These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:

[INFO] [starting] first 100k docs, just text
100000it [00:07, 12524.57it/s]
[INFO] [finished] first 100k docs, just text [8.06s]
[INFO] [starting] first 100k docs, with html, txt, vdom, inlink, outlink
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: txt URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL mismatch for clueweb22-de0000-00-13406: outlink URL was https://www.jovanovic.com/quotidien.htm but html URL was https://www.jovanna.de/
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: outlink URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: txt URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: inlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: inlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: outlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: outlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
100000it [03:04, 541.70it/s]
[INFO] [finished] first 100k docs, with html, txt, vdom, inlink, outlink [03:05]

As expected parsing the WARC files is 22x slower than just reading the JSONL file.

seanmacavaney commented 1 year ago

Great news, my copy of the CW22 drive arrived.

janheinrichmerker commented 1 year ago

Great to hear that!

janheinrichmerker commented 1 year ago

I've updated the branch to reflect upstream changes and added default_text() implementations.

janheinrichmerker commented 1 year ago

Is anything still blocking the merge?

seanmacavaney commented 1 year ago

Sorry -- the only thing blocking is finding the time to run through the tests on my end.

janheinrichmerker commented 7 months ago

Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to ir_datasets. If there is anything I can help with, please let me know.

janheinrichmerker commented 5 months ago

Closing this PR in favor of the new ir-datasets-clueweb22 extension.