aboutcode-org / purldb

Tools to create and expose a database of purls (Package URLs). This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ and nexB for https://www.aboutcode.org/ Chat is at https://gitter.im/aboutcode-org/discuss
https://purldb.readthedocs.io/
35 stars 22 forks source link

Suggestion: add wikidata information for a package, if available #39

Open armijnhemel opened 1 year ago

armijnhemel commented 1 year ago

For quite some open source packages there is a Wikidata identifier, for example for GNU bash: https://www.wikidata.org/wiki/Q189248

It could be interesting to add this information, if available, so the identifier could then be used by other tools to for example look up/display data from Wikipedia. I am not suggesting that the contents from wikipedia pages should be indexed, but only recording Wikidata identifiers if available.

daredevil3435 commented 1 year ago

can you explain more? I would like to work on this.

armijnhemel commented 1 year ago

can you explain more? I would like to work on this.

(Tagging @pombredanne for additional background information)

The way that I would envision is that in purldb there would be an optional field with data that could include the Wikidata identifier. Wikidata identifiers are short alphanumerical identifiers, starting with 'Q'. The one for bash is 'Q189248' (as linked above). So if I would query purldb for bash, or I would get results for bash (not sure how that will work, @pombredanne can probably clarify) and there is a known Wikidata identifier it would be returned. I suppose that the data model of purldb will allow for such kind of extra data.

On the indexing side things are a little bit murkier. Not every package out there will have a Wikidata item associated with it. In fact, I am expecting it to be fairly rare. There are a few methods I can think of:

  1. manual collection of identifiers and keeping a mapping, perhaps done via some visual editing
  2. querying the Wikidata API using SPARQL, see https://w.wiki/6NDk for an example (click "Run query" at the bottom of the page), grabbing the results in CSV, TSV or JSON, and then post process. There seems to be a property "source code repository" that can be searched for. This would not be complete (there could be entries without this property), but probably a really good start.

It looks like currently there are a bit over 15,000 entries in Wikidata that have the property "source code property". After extracting the wikidata identifier from the search results, the data for the entry itself can be queried and then it can be cross-correlated with any existing data in purldb so it can be enhanced.

armijnhemel commented 1 year ago

wikidata.zip

I manually did a query for the property "source code repository" in wikidata and downloaded the results in JSON. I have added them here. This would just be a first step.

@pombredanne would this work best as an "improver"?

pombredanne commented 1 year ago

re: "would this work best as an "improver"?"

That's going to be a regular visitor/mapper here IMHO

armijnhemel commented 1 year ago

There is more that is available in the wikidata data that could be useful for cross referencing. Looking at for example https://www.wikidata.org/wiki/Q11246433 there is:

and perhaps a few other useful things.