DataDog / guarddog

:snake: :mag: GuardDog is a CLI tool to Identify malicious PyPI and npm packages
https://securitylabs.datadoghq.com/articles/guarddog-identify-malicious-pypi-packages/
Apache License 2.0
611 stars 44 forks source link
malicious-packages npm npm-packages pypi-packages python python-security software-supply-chain-security

GuardDog

Test Static analysis

GuardDog

GuardDog is a CLI tool that allows to identify malicious PyPI and npm packages or Go modules. It runs a set of heuristics on the package source code (through Semgrep rules) and on the package metadata.

GuardDog can be used to scan local or remote PyPI and npm packages or Go modules using any of the available heuristics.

GuardDog demo usage

Getting started

Installation

pip install guarddog

Or use the Docker image:

docker pull ghcr.io/datadog/guarddog
alias guarddog='docker run --rm ghcr.io/datadog/guarddog'

Note: On Windows, the only supported installation method is Docker.

Sample usage

# Scan the most recent version of the 'requests' package
guarddog pypi scan requests

# Scan a specific version of the 'requests' package
guarddog pypi scan requests --version 2.28.1

# Scan the 'request' package using 2 specific heuristics
guarddog pypi scan requests --rules exec-base64 --rules code-execution

# Scan the 'requests' package using all rules but one
guarddog pypi scan requests --exclude-rules exec-base64

# Scan a local package archive
guarddog pypi scan /tmp/triage.tar.gz

# Scan a local package directory
guarddog pypi scan /tmp/triage/

# Scan every package referenced in a requirements.txt file of a local folder
guarddog pypi verify workspace/guarddog/requirements.txt

# Scan every package referenced in a requirements.txt file and output a sarif file - works only for verify
guarddog pypi verify --output-format=sarif workspace/guarddog/requirements.txt

# Output JSON to standard output - works for every command
guarddog pypi scan requests --output-format=json

# All the commands also work on npm or go
guarddog npm scan express

# Run in debug mode
guarddog --log-level debug npm scan express

Heuristics

GuardDog comes with 2 types of heuristics:

PyPI

Source code heuristics:

Heuristic Description
shady-links Identify when a package contains an URL to a domain with a suspicious extension
obfuscation Identify when a package uses a common obfuscation method often used by malware
clipboard-access Identify when a package reads or write data from the clipboard
exfiltrate-sensitive-data Identify when a package reads and exfiltrates sensitive data from the local system
download-executable Identify when a package downloads and makes executable a remote binary
exec-base64 Identify when a package dynamically executes base64-encoded code
silent-process-execution Identify when a package silently executes an executable
dll-hijacking Identifies when a malicious package manipulates a trusted application into loading a malicious DLL
bidirectional-characters Identify when a package contains bidirectional characters, which can be used to display source code differently than its actual execution. See more at https://trojansource.codes/
steganography Identify when a package retrieves hidden data from an image and executes it
code-execution Identify when an OS command is executed in the setup.py file
cmd-overwrite Identify when the 'install' command is overwritten in setup.py, indicating a piece of code automatically running when the package is installed

Metadata heuristics:

Heuristic Description
empty_information Identify packages with an empty description field
release_zero Identify packages with an release version that's 0.0 or 0.0.0
typosquatting Identify packages that are named closely to an highly popular package
potentially_compromised_email_domain Identify when a package maintainer e-mail domain (and therefore package manager account) might have been compromised
unclaimed_maintainer_email_domain Identify when a package maintainer e-mail domain (and therefore npm account) is unclaimed and can be registered by an attacker
repository_integrity_mismatch Identify packages with a linked GitHub repository where the package has extra unexpected files
single_python_file Identify packages that have only a single Python file
bundled_binary Identify packages bundling binaries
deceptive_author This heuristic detects when an author is using a disposable email

npm

Source code heuristics:

Heuristic Description
npm-serialize-environment Identify when a package serializes 'process.env' to exfiltrate environment variables
npm-obfuscation Identify when a package uses a common obfuscation method often used by malware
npm-silent-process-execution Identify when a package silently executes an executable
shady-links Identify when a package contains an URL to a domain with a suspicious extension
npm-exec-base64 Identify when a package dynamically executes code through 'eval'
npm-install-script Identify when a package has a pre or post-install script automatically running commands
npm-steganography Identify when a package retrieves hidden data from an image and executes it
bidirectional-characters Identify when a package contains bidirectional characters, which can be used to display source code differently than its actual execution. See more at https://trojansource.codes/
npm-dll-hijacking Identifies when a malicious package manipulates a trusted application into loading a malicious DLL
npm-exfiltrate-sensitive-data Identify when a package reads and exfiltrates sensitive data from the local system

Metadata heuristics:

Heuristic Description
empty_information Identify packages with an empty description field
release_zero Identify packages with an release version that's 0.0 or 0.0.0
potentially_compromised_email_domain Identify when a package maintainer e-mail domain (and therefore package manager account) might have been compromised; note that NPM's API may not provide accurate information regarding the maintainer's email, so this detector may cause false positives for NPM packages. see https://www.theregister.com/2022/05/10/security_npm_email/
unclaimed_maintainer_email_domain Identify when a package maintainer e-mail domain (and therefore npm account) is unclaimed and can be registered by an attacker; note that NPM's API may not provide accurate information regarding the maintainer's email, so this detector may cause false positives for NPM packages. see https://www.theregister.com/2022/05/10/security_npm_email/
typosquatting Identify packages that are named closely to an highly popular package
direct_url_dependency Identify packages with direct URL dependencies. Dependencies fetched this way are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.
npm_metadata_mismatch Identify packages which have mismatches between the npm package manifest and the package info for some critical fields
bundled_binary Identify packages bundling binaries
deceptive_author This heuristic detects when an author is using a disposable email

go

Source code heuristics:

Heuristic Description
shady-links Identify when a package contains an URL to a domain with a suspicious extension

Custom Rules

Guarddog allows to implement custom sourcecode rules. Sourcecode rules live under the guarddog/analyzer/sourcecode directory, and supported formats are Semgrep or Yara.

Is possible then to write your own rule and drop it into that directory, Guarddog will allow you to select it or exclude it as any built-in rule as well as appending the findings to its output.

For example, you can create the following semgrep rule:

rules:
  - id: sample-rule 
    languages:
      - python
    message: Output message when rule matches
    metadata:
      description: Description used in the CLI help
    patterns:
        YOUR RULE HEURISTICS GO HERE  
    severity: WARNING

Then you'll need to save it as sample-rule.yml and note that the id must match the filename

In the case of Yara, you can create the following rule:

rule sample-rule
{
  meta:
    description = "Description used in the output message"
    target_entity = "file"
  strings:
    $exec = "exec"
  condition:
    1 of them
}

Then you'll need to save it as sample-rule.yar.

Note that in both cases, the rule id must match the filename

Running GuardDog in a GitHub Action

The easiest way to integrate GuardDog in your CI pipeline is to leverage the SARIF output format, and upload it to GitHub's code scanning feature.

Using this, you get:

Sample GitHub Action using GuardDog:

name: GuardDog

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

permissions:
  contents: read

jobs:
  guarddog:
    permissions:
      contents: read # for actions/checkout to fetch code
      security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
    name: Scan dependencies
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"

      - name: Install GuardDog
        run: pip install guarddog

      - run: guarddog pypi verify requirements.txt --output-format sarif --exclude-rules repository_integrity_mismatch > guarddog.sarif

      - name: Upload SARIF file to GitHub
        uses: github/codeql-action/upload-sarif@v3
        with:
          category: guarddog-builtin
          sarif_file: guarddog.sarif

Development

Running a local version of GuardDog

Using pip

Using poetry

Unit tests

Running all unit tests: make test

Running unit tests against Semgrep rules: make test-semgrep-rules (tests are here). These use the standard methodology for testing Semgrep rules.

Running unit tests against package metadata heuristics: make test-metadata-rules (tests are here).

Benchmarking

You can run GuardDog on legitimate and malicious packages to determine false positives and false negatives. See ./tests/samples

Code quality checks

Run the type checker with

mypy --install-types --non-interactive guarddog

and the linter with

flake8 guarddog --count --select=E9,F63,F7,F82 --show-source --statistics --exclude tests/analyzer/sourcecode,tests/analyzer/metadata/resources,evaluator/data
flake8 guarddog --count --max-line-length=120 --statistics --exclude tests/analyzer/sourcecode,tests/analyzer/metadata/resources,evaluator/data --ignore=E203,W503

Maintainers

Authors:

Acknowledgments

Inspiration: