commoncrawl / cc-webgraph

Tools to construct and process webgraphs from Common Crawl data
Apache License 2.0
79 stars 4 forks source link

Duplicate nodes in domain-level webgraph #3

Closed covuworie closed 2 years ago

covuworie commented 2 years ago

Hi,

I noticed duplicate rows in the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph. You can find it by unzipping the file and running the command:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt

This leads to the following output:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt
720700  1.4451489E7     273019  1.4683511019933418E-7   no.hordaland    1
46550260        1.0575547E7     79735653        4.063660522732166E-9    no.hordaland    1

This is the only duplicate I found in the entire file.

You probably want to do a sanity check for this in whatever is your preferred language. This is how I found the issue in the first place! I used python with pandas:

import pandas as pd

df = pd.read_csv(
    "./dataset/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz",
    compression="gzip",
    sep="\t")

# should work but crashes
# print(df["#host_rev"].is_unique())

# using unique instead works
print(len(df) == len(df["#host_rev"].unique()))

# to see non-unique values
print(df["#host_rev"].value_counts())

Pandas takes several minutes for some of these operations due to loading the full file in memory. You may want to try vaex or something similar if using python. I'm sure there are similar tools in Java and other languages. You could of course do the check in other ways, but I thought I'd provide the code in case it's useful.

Note I haven't checked yet if these duplicate nodes are present in the other web graph files but I suspect they probably are. So you probably want to do similar sanity checks for those. This should improve the quality of what is an excellent dataset. Thanks very much for making this available!

sebastian-nagel commented 2 years ago

Thanks, @covuworie! I can confirm the issue. It's caused by two factors:

  1. no and os.hordaland.no are ICANN suffixes in the public suffix list but hordaland.no is not

  2. in the input there is also os.hordaland.no (in reverse domain name notation):

    334975755   no.hordaland
    ...
    334975765   no.hordaland.os
    334975766   no.hordaland.os.bibliotek
    334975767   no.hordaland.oygarden

The assumption is that a sorted list of host names in reversed domain name notation can be "folded" to the level of registered domains with constant memory requirements only remembering the latest domain name (no need to keep all 90 million domain names in memory). Unfortunately, the switching from hordaland.no to bibliotek.os.hordaland.no and back is a edge case to be considered.

I'll fix this in the code which folds the host-level graph into the domain-level one (but likely not for existing graphs). And yes, eventually add also a verification step whether there are duplicated vertices. Thanks again!

covuworie commented 2 years ago

Thanks @sebastian-nagel for your quick response on this issue! I look forward to seeing the updates.

sebastian-nagel commented 2 years ago

Fixed in 6b4be52 and f663b5c: it's now made sure that domain names are strictly sorted lexicographically - strict sorting does not allow for duplicates. This can be verified by running:

zcat cc-main-2022-may-jun-aug-domain-vertices.txt.gz | cut -f2 | LC_ALL=C sort -uc