Closed botondbarta closed 1 year ago
@acheronw Let's formalize this:
bbc.com/*
can be served simply with the bbc.com
domain pattern.args.tld
instead of hu
, but now that we are allowing domains, it should be different. Check out the old cdx-index-client.py
around line 291.@Baaart25 I do not intend to support downloading every collection. However, there's find_new_dumps.py
, which you can use to get the names of the collections (that you haven't downloaded already). All you need then is to add a loop around it (I do it with ipython + os.system
).
Huston, we have a problem.
I tested the current downloader on the newest CC batch, and it's not working, because of changes in the format.
Instead of a line starting with the TLD followed by a comma, it now has a closing parenthes and a dash:
hu)/1029 20230322135339 {"url": "https://www3.hu/1029/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "SKNX4KCV5V6FN6VFN2EM27T4BLTF27BQ", "length": "28807", "offset": "1198745015", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296943809.76/warc/CC-MAIN-20230322114226-20230322144226-00009.warc.gz", "charset": "UTF-8", "languages": "hun,eng"}
Because of this, the following line in index.py#find_tld_in_index method does not yield the TLD of the current line:
ctld = cluster.surt[:cluster.surt.find(',')]
and thus we never satisfy this condition:
if ctld == tld:
So only the first cluster is found for any match.
Then line matching also fails, because in get_indexfiles.py#main we match for:
re.compile(f'^{args.pattern},')
and that's not good for the latest format.
So we find only the first cluster and then we discard every line in it...
Update:
The situation is not as bad as it seemed. Only certain lines in common crawl index are messed up.
There is a hungarian domain "www3.hu", which confused the common crawl's parser. They discarded the www3, and kept only the hu part of their domain, put it right in front of the *.hu list, and thus confused our parser as well:
common crawl parsed it like this:
hu)/1029 20230322135339 {"url": "https://www3.hu/1029/"
but they should have parsed it like this:
hu,www3)/1029 20230322135339 {"url": "https://www3.hu/1029/"
So we will need a workaround for this.
Note: most likely this site has always been treated like this by the common crawl indexing, but this is the first time that one of their urls got to a position where it became the first in a cluster of 3000 urls, and thus appeared in the index.
Solved by #47.
@Baaart25 let us know if this solves your problem.
The old script supported downloading indexes that matched certain queries. However the current downloader only downloads pages matching a specific TLD. Would it be possible to make a modification to allow the download of certain URL matching sites, such as bbc.com/*? Also found a bug: everything that is downloaded by the get_indexfiles.py is saved under the name: domain-hu-{collection}.gz Also making it possible to download every collection from CommonCrawl in one run would be a good idea.