CC-News-En Support - Githubissues

Dataset Information:

A large (40M document) news corpus derived from CCNews, with associated query variations (UQVs).

Currently, the corpus is best used for efficiency work, since there are no "official" qrels from pooled systems.

Links to Resources:

All data: https://go.unimelb.edu.au/u3nj (or https://cloudstor.aarnet.edu.au/plus/s/M8BvXxe6faLZ4uE)
- Documents: See the warc directory and the following (janky) script: https://github.com/JMMackenzie/CC-News-Tools/blob/master/get_warc.sh
- Queries: See queries/cc-news-queries.json
- Qrels: See queries/cc-news-per-topic-qrels.txt
Overview paper: https://dl.acm.org/doi/10.1145/3340531.3412762

Dataset ID(s):

cc-news-en - Document corpus cc-news-en/queries -- Queries and qrels

Supported Entities

[x] docs
[x] queries
[x] qrels
[] scoreddocs
[] docpairs

Additional comments/concerns/ideas/etc.

One problem is that the data is currently stored on AARNET and can be slow or difficult to download; there's almost 1TB of data here -- Perhaps supporting the full collection is not the best way forward right now.
Note that the qrels are "pseudo qrels" since the UQV generation task was based on a re-finding simulation, so there is only one qrel per topic.
In the queries file, you probably want to just use the uid and the spellnorm_query as the identifier/query pairs. Since there are multiple queries per topic, it's easiest to just re-map the identifiers when doing evaluation.
We also provide crowdworker data and keystroke-level data, a pre-built CIFF index from Anserini with pre-normalized queries, etc. But those are probably not useful here.

I'm happy to provide any support you may need in making the dataset more visible or accessible to the community, so let me know how I can help. Great project, by the way!

Very cool Joel! I don't have the time to add this straight away, but I'll record some of my thoughts for future reference and/or discussion.

Dataset IDs

I was tempted to collapse the two proposed IDs into just cc-news-en (much like vaswani, cranfield, trec-robust04, etc.), but the corpus is definitely general-purpose enough that other query sets could be built off it. Maybe something a little more descriptive than cc-news-en/queries for the query collection though. How about cc-news-en/reddit?

Downloads

I was getting a download speed of around 8-10MB/s. That would put downloading the entire collection at around 40 hours. That's not crazy, as the tweets2013-ia collection also takes around as long. (That's mostly due to the slow downloads from archive.org, rather than an enormous size.)

Does downloading multiple files in parallel help?

This is by far the largest download we'd have. Do we need anything special, like a check that there's adequate disk space before downloading? At least a warning is warranted.

The corpus will be another case where that needs to be exempt from the weekly download checks because it'll time out the github action. I need to think about alternatives here. Maybe a special action for this dataset that splits it up by source file, and does it less frequently than weekly?

Do we want to change the compression of the files to lz4 as we download to improve decompression speeds later? It can probably be done over the download stream itself without adding much overhead. How does this relate to the creation of checkpoint files (below)? Could these just be created on the fly, rather than hosted somewhere?

How do we handle the case where the user already has a copy of the dataset? Give them the opportunity to copy/link? How does this relate to the re-encoding above?

Document Lookups

This is almost certainly a situation where we'll look up from source, rather than building a lz4docstore. The doc IDs (e.g., CC-NEWS-20160826124520-00000-3) match the file structure (e.g., CC-NEWS-20160826124520-00000_ENG.warc.gz), so that's not a concern. The above file is ~500GB and contains 23,970 documents, so we probably want to do the zlib checkpoint trick we do for the cluewebs for speeding up lookups. (These individual files are actually larger than the clueweb source files, further motivating the use of checkpoint files.) I wonder if Joel & company would be willing to host the checkpoints?

HTML parsing

If we follow the model from the cluewebs, extracting of the textual content from the html would be handled by a wrapper. However, based on how most tools so far are using ir_datasets (relying on the dataset ID and do not provide an easy way to apply wrappers), we may need to retire the wrapper approach. This dataset could be a good place to try the alternative approach of providing multiple versions of the corpus with different processing. E.g., cc-news-en and cc-news-en/extracted. I suppose the predominant case would be to want the extracted text instead of the raw HTML, so maybe cc-news-en and cc-news-en/html instead?

As an aside, we need a faster way to extract raw text from HTML; bs4 is just too slow. I need to look more into what's available that handles messy/invalid HTML as well as bs4 does.

Queries and Qrels

Your suggestion above for mapping the IDs looks good to me.

Image/Audio/Keystroke/CIFF Data

These seem like nice resources, but I think we can ignore them for now, as the focus of the project is on textual data.

Thanks for checking this out Sean! I'll add some responses in line. I'll also reiterate that this isn't a priority, I just wanted to flag it and "wishlist" it for a future release.

Dataset IDs

I was tempted to collapse the two proposed IDs into just cc-news-en (much like vaswani, cranfield, trec-robust04, etc.), but the corpus is definitely general-purpose enough that other query sets could be built off it. Maybe something a little more descriptive than cc-news-en/queries for the query collection though. How about cc-news-en/reddit?

Agreed, that is pretty reasonable considering the way the queries were built.

Downloads

I was getting a download speed of around 8-10MB/s. That would put downloading the entire collection at around 40 hours. That's not crazy, as the tweets2013-ia collection also takes around as long. (That's mostly due to the slow downloads from archive.org, rather than an enormous size.)

Okay, good to know, that's OK considering the data would be coming from Australia!

Does downloading multiple files in parallel help?

You can definitely parallelize it. There are some recommended tools too, although I am not sure whether they will be necessary or not (I guess this is your call to make :-) See here.

Do we want to change the compression of the files to lz4 as we download to improve decompression speeds later? It can probably be done over the download stream itself without adding much overhead. How does this relate to the creation of checkpoint files (below)? Could these just be created on the fly, rather than hosted somewhere?

How do we handle the case where the user already has a copy of the dataset? Give them the opportunity to copy/link? How does this relate to the re-encoding above?

Document Lookups

This is almost certainly a situation where we'll look up from source, rather than building a lz4docstore. The doc IDs (e.g., CC-NEWS-20160826124520-00000-3) match the file structure (e.g., CC-NEWS-20160826124520-00000_ENG.warc.gz), so that's not a concern. The above file is ~500GB and contains 23,970 documents, so we probably want to do the zlib checkpoint trick we do for the cluewebs for speeding up lookups. (These individual files are actually larger than the clueweb source files, further motivating the use of checkpoint files.) I wonder if Joel & company would be willing to host the checkpoints?

What do the checkpoints look like, and how large are they? We could probably arrange some storage for those, I doubt it would be a problem. We're almost capped on that CloudStor repo but we can make more space if need be.

HTML parsing

If we follow the model from the cluewebs, extracting of the textual content from the html would be handled by a wrapper. However, based on how most tools so far are using ir_datasets (relying on the dataset ID and do not provide an easy way to apply wrappers), we may need to retire the wrapper approach. This dataset could be a good place to try the alternative approach of providing multiple versions of the corpus with different processing. E.g., cc-news-en and cc-news-en/extracted. I suppose the predominant case would be to want the extracted text instead of the raw HTML, so maybe cc-news-en and cc-news-en/html instead?

This sounds interesting. I wonder (and this could be a totally different issue) whether we could work in a "structured extraction" where we can preserve a series of fields (say, a stream for body text, titles, inlinks, etc)? This could be very useful for these larger web-like collections.

What do the checkpoints look like, and how large are they?

documentation on the checkpoint files are found here https://github.com/allenai/ir_datasets/blob/master/docs/clueweb_warc_checkpoints.md

For ClueWeb, they ended up being around 0.1% the size of the source files. How frequently checkpoints are taken can be tuned. I believe I took one checkpoint every 8MB for ClueWeb. Except at extreme values, it's a pretty basic trade-off between size and speed.

whether we could work in a "structured extraction" where we can preserve a series of fields (say, a stream for body text, titles, inlinks, etc)?

There's some precedent for this type of thing, e.g., wapo/v2. But it was easier there because the structured data there was provided directly by the dataset. The "right" way to process HTML has a lot of room for debate, which is why I went with the wrapper approach to begin with.

I think you're right that this topic is better as a separate issue. We can follow the ClueWeb approach as a start.

Great, we'd be happy to store checkpoints on CloudStor with the remainder of the data in that case, no problem at all!

allenai / ir_datasets

CC-News-En Support #63