You can't compare this with tldextract. Tldextract extracts more data correctly while This domain scanner can't

vihaanmody1 commented 2 years ago

Tldextract extracts ips and http schemes with url while this extractor can't. The speed doesn't matter in this case. What matters is the correctness of the data scraped.

wavenator commented 2 years ago

Thank you for your comment, @Vihaanmody21. TLDextract is a reliable library that performs well. Additionally, it supports a few more use-cases than this library. When we extract billions of domains a day for our internal use-case, performance is crucial. Taking action on our security product in a timely manner is crucial.

In our library, we aim to destruct domains into their constituent parts, and nothing else.

To address the correctness argument, I would love to get as much data as possible that points to the issues. I will do everything I can to resolve it.

Thank you!

vihaanmody1 commented 1 year ago

Hello @wavenator

PyDomainExtracter is a great tool for extracting. But if an URL has a scheme in it, it won't work. While TLDextract can extract URLS with schemes and IPs.

wavenator commented 1 year ago

I completely agree with your viewpoint. TLDExtract is an excellent library designed to extract domains from diverse sources and data formats, which is not within our scope to address.

Regarding the schemes and IPs, could you please provide some examples of the ones you would like us to extract but are currently not supported? This way, we can keep track of them and consider incorporating them in the future.

elliotwutingfeng commented 9 months ago

Here is a workaround wrapper function that can handle URLs with and without scheme, with and without port/path (at the cost of slower execution time), similar to tldextract.

import pydomainextractor
import tldextract  # for benchmarks later

pde = pydomainextractor.DomainExtractor()

def extract(s):
    if pde.is_valid_domain(s):
        return pde.extract(s)
    try:
        return pde.extract_from_url(s)
    except ValueError:
        return pde.extract_from_url("//" + s)

print(extract("https://a.b.c.example.com.sg"))
print(extract("https://a.b.c.example.com.sg:5000"))
print(extract("https://a.b.c.example.com.sg/path"))
print(extract("https://a.b.c.example.com.sg:5000/path"))

print(extract("a.b.c.example.com.sg"))
print(extract("a.b.c.example.com.sg:5000"))
print(extract("a.b.c.example.com.sg/path"))
print(extract("a.b.c.example.com.sg:5000/path"))

# {'suffix': 'com.sg', 'domain': 'example', 'subdomain': 'a.b.c'}

Benchmarks

A Rust-based parser vastly outperforms a pure Python parser in CPython. However, for handling ambiguous input, tldextract can match the speed of pydomainextractor when on PyPy.
It should be noted that the wrapper function attempts to parse a.b.c.example.com.sg:5000/a/b/c (schemeless, but with path) twice, hence the abnormally slow timing of 631ns. Further optimizations on the Rust-side can possibly eliminate this bottleneck.
pydomainextractor does not work with IPv4 or IPv6 addresses, while tldextract handles both.
pydomainextractor doesn't perform well on PyPy.

pydomainextractor

%timeit extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 277 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 631 ns ± 8.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 411 ns ± 2.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 3.63 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 3.84 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 3.56 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

tldextract

%timeit tldextract.extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 2.13 µs ± 8.73 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 1.79 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 1.78 µs ± 7.46 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 338 ns ± 1.49 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 231 ns ± 0.647 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 232 ns ± 2.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Intsights / PyDomainExtractor