anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.24k stars 574 forks source link

pip cataloger should support repository url #680

Open sambhav opened 2 years ago

sambhav commented 2 years ago

What would you like to be added:

when pip packages are installed from non default pip indices (pypi), we should store the pip repository url in the sbom

Why is this needed: useful to know the origin of a package

Additional context:

wagoodman commented 2 years ago

This is a great idea --do you happen to know if this information is stored for each package installed? That is, if looking at a site-packages directory with several installations, is it possible to locally conclude which local package was specifically pulled from which pip index?

sambhav commented 2 years ago

@wagoodman - sadly it looks like this information is not available. cc: @pradyunsg if you have any more details.

pradyunsg commented 2 years ago

This information is not stored in the metadata by pip.

The only way to get this is going to be possible is by controlling the pip install call, and checking what index URL it is using (likely by using pip config and PIP_INDEX_URL).

sambhav commented 2 years ago

@pradyunsg that might be tricky though right? pip might have installed it from one of the extra index urls or via find-links, some of which may also be project specific configuration rather than a global pip opt. Would it be worth opening an issue for pip to store this metadata? The rationale being that certain index servers might store different copies of the same package/version and we might want to identify the origin in the output SBOM/vuln analysis.

pradyunsg commented 2 years ago

You might want to check with pip-audit folks, who are generating SBOMs. If that doesn't go anywhere, filing an issue against pip seems reasonable to me!

sambhav commented 2 years ago

Created https://github.com/pypa/pip/issues/10736

luhring commented 2 years ago

IIUC, we should consider this issue "blocked" until the data is available for Syft to observe in the scan target. If I'm wrong here, just let me know! 😄