aboutcode-org / purldb-data

A dataset of purl for offline lookup and verification usage. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ and nexB for https://www.aboutcode.org/ Chat is at https://gitter.im/aboutcode-org/discuss
4 stars 1 forks source link

Wrong dump data for Debian packages #1

Open armijnhemel opened 1 year ago

armijnhemel commented 1 year ago

Not sure if this should go here or another repository, so feel free to move.

I just looked at deb-purls-aa.json.zst and saw this line:

{"purl":"pkg:deb/0ad@0.0.23.1-4","download_url":"http://ftp.debian.org/debian/pool/main/0/0ad/0ad_0.0.23.1.orig.tar.xz"}

The package number and the referenced source code file do not match: the file in download_url is the original file and is actually the same for multiple patch versions. The version number only becomes -4 after applying the Debian specific patches, so these should probably also be included. The patches for -4 are no longer available via the Debian FTP, but for -5 they are.

The .dsc file for -5 says:

Files:
 4fa111410ea55de7a013406ac1013668 31922812 0ad_0.0.23.1.orig.tar.xz
 43a5bf77192a8eebdbe763cdd1d72fa3 73620 0ad_0.0.23.1-5.debian.tar.xz

So possibly you should not have this as a single download URL, but as a list of download URLs.

Also, with Debian these URLs tend to get moved (granted, after many years) to their archive. It might be good to take a closer look at https://github.com/nexB/fetchcode/issues/82

pombredanne commented 1 year ago

@armijnhemel you have eagle yes! thanks for the report. I do not have yet a good mostly universal solution on how to deal with these cases where multiple download URLs exist for a single package, like you found where we have patches and sources into a binary

pombredanne commented 1 year ago

The point is that for now the model is to have one download URL == one record in the purldb We can however track multiple purls for the related source packages though we do not have the proper DB models and relationship yet

armijnhemel commented 1 year ago

The point is that for now the model is to have one download URL == one record in the purldb We can however track multiple purls for the related source packages though we do not have the proper DB models and relationship yet

Having thought a bit about this there are some other issues as well, which can possibly interfere (not in this particular case, but in general).

First of all, there is the situation where there are multiple files/download URLs that point to the same package. For example, let's look at GNU binutils: https://ftp.gnu.org/gnu/binutils/

For 2.30 there are four distinct downloads: a .tar.bz2, a .tar.gz, a .tar.lz and a .tar.xz. These are all equivalent and should map to the same package URL and possibly back as well.

Then there is the situation where multiple components/sources are used in a certain configuration (like in the Debian example). So what I could envision is that download_url for a version would be something like this:

download_url = [
    [url1, patch1, patch2],
    [url2, patch1, patch2]
]

Or something like that.

armijnhemel commented 1 year ago

Some more thoughts: Debian typically renames the original files (to something like foo_bar-1.0.orig.tar.gz if the original is called foo_bar-1.0.tar.gz). It also lowercases the files and replaces - with _.

A question: when encountering these (without patches or other files, just standalone), should they be mapped to the original package or to the Debian package? There is something to say for both.