DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Download specific URLs #46

Closed botondbarta closed 1 year ago

botondbarta commented 1 year ago

The old script supported downloading indexes that matched certain queries. However the current downloader only downloads pages matching a specific TLD. Would it be possible to make a modification to allow the download of certain URL matching sites, such as bbc.com/*? Also found a bug: everything that is downloaded by the get_indexfiles.py is saved under the name: domain-hu-{collection}.gz Also making it possible to download every collection from CommonCrawl in one run would be a good idea.

DavidNemeskey commented 1 year ago

@acheronw Let's formalize this:

  1. Let's allow arbitrary domains with an optional single wild card at the beginning (which you can simply cut off). Let's forget about the path component for now, the bbc.com/* can be served simply with the bbc.com domain pattern.
  2. Let's allow multiple patterns to be given to the script.
    1. Rename the command line argument, since it's not only TLDs anymore that we accept. Also allow it to be specified more than once
    2. Add another option to the script that allows the user to specify a batch of patterns in a file (one per line). This should be mutually exclusive with the former one.
    3. This should still be a single download, so order them alphabetically, assemble the files to download for each and do the filtering for all of them at the same time (maybe by creating one big regex from all patterns)
  3. Naming bug (line 110). Should have been args.tld instead of hu, but now that we are allowing domains, it should be different. Check out the old cdx-index-client.py around line 291.
DavidNemeskey commented 1 year ago

@Baaart25 I do not intend to support downloading every collection. However, there's find_new_dumps.py, which you can use to get the names of the collections (that you haven't downloaded already). All you need then is to add a loop around it (I do it with ipython + os.system).

acheronw commented 1 year ago

Huston, we have a problem.

I tested the current downloader on the newest CC batch, and it's not working, because of changes in the format.

Instead of a line starting with the TLD followed by a comma, it now has a closing parenthes and a dash: hu)/1029 20230322135339 {"url": "https://www3.hu/1029/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "SKNX4KCV5V6FN6VFN2EM27T4BLTF27BQ", "length": "28807", "offset": "1198745015", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943809.76/warc/CC-MAIN-20230322114226-20230322144226-00009.warc.gz", "charset": "UTF-8", "languages": "hun,eng"}

Because of this, the following line in index.py#find_tld_in_index method does not yield the TLD of the current line: ctld = cluster.surt[:cluster.surt.find(',')]

and thus we never satisfy this condition: if ctld == tld:

So only the first cluster is found for any match.

Then line matching also fails, because in get_indexfiles.py#main we match for: re.compile(f'^{args.pattern},')

and that's not good for the latest format.

So we find only the first cluster and then we discard every line in it...

acheronw commented 1 year ago

Update:

The situation is not as bad as it seemed. Only certain lines in common crawl index are messed up.

There is a hungarian domain "www3.hu", which confused the common crawl's parser. They discarded the www3, and kept only the hu part of their domain, put it right in front of the *.hu list, and thus confused our parser as well:

common crawl parsed it like this:

hu)/1029 20230322135339 {"url": "https://www3.hu/1029/"

but they should have parsed it like this: hu,www3)/1029 20230322135339 {"url": "https://www3.hu/1029/"

So we will need a workaround for this.

Note: most likely this site has always been treated like this by the common crawl indexing, but this is the first time that one of their urls got to a position where it became the first in a cluster of 3000 urls, and thus appeared in the index.

DavidNemeskey commented 1 year ago

Solved by #47.

@Baaart25 let us know if this solves your problem.