Lookyloo / lookyloo

Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.
https://www.lookyloo.eu
Other
678 stars 83 forks source link

[Feature] CERT PL phishing truncated hash HTML structure #905

Closed Rafiot closed 5 months ago

Rafiot commented 5 months ago

Is your feature request related to a problem? Please describe.

Some kind of fuzzy hash on the HTML content of a rendered page.

Describe the solution you'd like

Hash generation used by CERT PL:

from bs4 import BeautifulSoup  # type: ignore

def sha256_128(data: bytes) -> str:
    return sha256(data).hexdigest()[:32]

def extract_tags(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.findAll()
    return "|".join(t.name for t in tags)

def html_hash(content: Union[str, bytes]) -> Optional[str]:
    try:
        if isinstance(content, bytes):
            html_content = content.decode()
        else:
            html_content = cast(str, content)

        tag_string = extract_tags(html_content)
        return sha256_128(tag_string.encode())

    except Exception as e:
        log.warning("Error while calculating html hash: %s", e)
        return None

Describe alternatives you've considered

No response

Additional context

No response