DataDog / guarddog

:snake: :mag: GuardDog is a CLI tool to Identify malicious PyPI and npm packages
https://securitylabs.datadoghq.com/articles/guarddog-identify-malicious-pypi-packages/
Apache License 2.0
614 stars 44 forks source link

False positive: Typosquatting #71

Closed HugooB closed 1 year ago

HugooB commented 1 year ago

Hi there, first off all: awesome tools you guys made! :tada: Second, I encountered the following output when scanning a requirements.txt file:

Found 2 potentially malicious indicators in ruamel-yaml version 0.17.21

typosquatting: This package closely ressembles the following package names, and might be a typosquatting attempt: ruamel-yaml, ruamel-yaml

code-execution: found 1 source code matches
  * setup.py file executing code at ruamel.yaml-0.17.21/setup.py:955
        subprocess.check_output(cmd)

I do get why the second indicator is found, but the first one confuses me:

The package name (also installed on my machine) is ruamel.yaml. There is no package named ruamel-yaml in either my requirements nor on PyPi. Did something went from with the dots in the package name? Or is it because this package is listed in your typosquatting list as ruamel-yaml?

Thanks!

christophetd commented 1 year ago

Hi Hugo,

Thanks for reporting! It does sound abnormal, likely a confusion with the - and the . indeed. We'll look into fixing it

maciejstromich commented 1 year ago

same applies to

Found 1 potentially malicious indicators in pdfminer.six version 20221105

typosquatting: This package closely ressembles the following package names, and might be a typosquatting attempt: pdfminer-six, pdfminer-six
HugooB commented 1 year ago

Looked a bit into it: the problem arises outside this project. In the used PyPi top list, these false positives are mentioned with a - instead of a . (see ruamel.yaml). You might fix that by making changes to either the original list or by including a separated list of known false positives..

christophetd commented 1 year ago

Thanks! I think the easiest way is to normalize the package name passed as an input, and consider that scanning a.b.c is actually going to scan a-b-c. Sounds good to you?

HugooB commented 1 year ago

Seems like the easiest way to fix it indeed! However, would that not increase the number of missed typosquats? E.g., with pdfminer.six published on PyPi, a malicious actor can publish pdfminer-six. With your proposed fix, this malicious package will not be detected, right?

QuinceyJames commented 1 year ago

Hi! I don't think that should be an issue since PEP 503 states:

This PEP references the concept of a “normalized” project name. As per PEP 426 the only valid characters in a name are the ASCII alphabet, ASCII numbers, ., -, and . The name should be lowercased with all runs of the characters ., -, or replaced with a single - character.

Additionally:

Repositories MAY redirect unnormalized URLs to the canonical normalized URL (e.g. /Foobar/ may redirect to /foobar/), however clients MUST NOT rely on this redirection and MUST request the normalized URL.

You'll notice PyPi does redirect all of these to the same site:

But more importantly, PIP does not rely on this redirection and gets the normalized URL thanks to pip._internal.utils.packaging. These utilities can be reused by importing packaging and normalizing the name before running the typosquat algorithm.

HugooB commented 1 year ago

Great job @QuinceyJames , thanks for the detailed information!

christophetd commented 1 year ago

Will be released shortly, thanks all!

christophetd commented 1 year ago

Released as part of v0.1.10 https://github.com/DataDog/guarddog/releases/tag/v0.1.10