Intsights / PyDomainExtractor

A blazingly fast domain extraction library written in Rust
MIT License
65 stars 6 forks source link

You can't compare this with tldextract. Tldextract extracts more data correctly while This domain scanner can't #16

Open vihaanmody1 opened 2 years ago

vihaanmody1 commented 2 years ago

Tldextract extracts ips and http schemes with url while this extractor can't. The speed doesn't matter in this case. What matters is the correctness of the data scraped.

wavenator commented 2 years ago

Thank you for your comment, @Vihaanmody21. TLDextract is a reliable library that performs well. Additionally, it supports a few more use-cases than this library. When we extract billions of domains a day for our internal use-case, performance is crucial. Taking action on our security product in a timely manner is crucial.

In our library, we aim to destruct domains into their constituent parts, and nothing else.

To address the correctness argument, I would love to get as much data as possible that points to the issues. I will do everything I can to resolve it.

Thank you!

vihaanmody1 commented 1 year ago

Hello @wavenator

PyDomainExtracter is a great tool for extracting. But if an URL has a scheme in it, it won't work. While TLDextract can extract URLS with schemes and IPs.

wavenator commented 1 year ago

I completely agree with your viewpoint. TLDExtract is an excellent library designed to extract domains from diverse sources and data formats, which is not within our scope to address.

Regarding the schemes and IPs, could you please provide some examples of the ones you would like us to extract but are currently not supported? This way, we can keep track of them and consider incorporating them in the future.

elliotwutingfeng commented 7 months ago

Here is a workaround wrapper function that can handle URLs with and without scheme, with and without port/path (at the cost of slower execution time), similar to tldextract.

import pydomainextractor
import tldextract  # for benchmarks later

pde = pydomainextractor.DomainExtractor()

def extract(s):
    if pde.is_valid_domain(s):
        return pde.extract(s)
    try:
        return pde.extract_from_url(s)
    except ValueError:
        return pde.extract_from_url("//" + s)

print(extract("https://a.b.c.example.com.sg"))
print(extract("https://a.b.c.example.com.sg:5000"))
print(extract("https://a.b.c.example.com.sg/path"))
print(extract("https://a.b.c.example.com.sg:5000/path"))

print(extract("a.b.c.example.com.sg"))
print(extract("a.b.c.example.com.sg:5000"))
print(extract("a.b.c.example.com.sg/path"))
print(extract("a.b.c.example.com.sg:5000/path"))

# {'suffix': 'com.sg', 'domain': 'example', 'subdomain': 'a.b.c'}

Benchmarks

pydomainextractor
%timeit extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 277 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 631 ns ± 8.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 411 ns ± 2.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 3.63 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 3.84 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 3.56 µs ± 258 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
tldextract
%timeit tldextract.extract("https://a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg:5000/a/b/c")
%timeit tldextract.extract("a.b.c.example.com.sg")

# CPython 3.11.7
# 2.13 µs ± 8.73 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# 1.79 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 1.78 µs ± 7.46 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# PyPy 3.9.18
# 338 ns ± 1.49 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 231 ns ± 0.647 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# 232 ns ± 2.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)