Closed covuworie closed 2 years ago
Thanks, @covuworie! I can confirm the issue. It's caused by two factors:
no
and os.hordaland.no
are ICANN suffixes in the public suffix list but hordaland.no
is not
in the input there is also os.hordaland.no
(in reverse domain name notation):
334975755 no.hordaland
...
334975765 no.hordaland.os
334975766 no.hordaland.os.bibliotek
334975767 no.hordaland.oygarden
The assumption is that a sorted list of host names in reversed domain name notation can be "folded" to the level of registered domains with constant memory requirements only remembering the latest domain name (no need to keep all 90 million domain names in memory). Unfortunately, the switching from hordaland.no
to bibliotek.os.hordaland.no
and back is a edge case to be considered.
I'll fix this in the code which folds the host-level graph into the domain-level one (but likely not for existing graphs). And yes, eventually add also a verification step whether there are duplicated vertices. Thanks again!
Thanks @sebastian-nagel for your quick response on this issue! I look forward to seeing the updates.
Fixed in 6b4be52 and f663b5c: it's now made sure that domain names are strictly sorted lexicographically - strict sorting does not allow for duplicates. This can be verified by running:
zcat cc-main-2022-may-jun-aug-domain-vertices.txt.gz | cut -f2 | LC_ALL=C sort -uc
Hi,
I noticed duplicate rows in the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph. You can find it by unzipping the file and running the command:
This leads to the following output:
This is the only duplicate I found in the entire file.
You probably want to do a sanity check for this in whatever is your preferred language. This is how I found the issue in the first place! I used
python
withpandas
:Pandas takes several minutes for some of these operations due to loading the full file in memory. You may want to try vaex or something similar if using python. I'm sure there are similar tools in Java and other languages. You could of course do the check in other ways, but I thought I'd provide the code in case it's useful.
Note I haven't checked yet if these duplicate nodes are present in the other web graph files but I suspect they probably are. So you probably want to do similar sanity checks for those. This should improve the quality of what is an excellent dataset. Thanks very much for making this available!