clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 40 forks source link

Include release version information from github #743

Open vbasem opened 4 years ago

vbasem commented 4 years ago

Currently github information associated with a component does not provide the release version information of that item.

This is becoming a rather pattern to perform versioning and releases within github itself, which is usually tag based. For example log4j

An approach would be to use the release api for every repository being scanned and store the version information beside the commit hash.

It would be very helpful to be able to lookup source code in github using components name and version in that fashion.

Please let me know if this is something worth considering and pursuing.

Basem Vaseghi basem.vaseghi@daimler.com, Daimler TSS GmbH, Impressum

jeffmendoza commented 4 years ago

Interestingly CD can harvest github components using a tag, but that is not how we queue them. I believe the reasoning is because the hash is immutable. When the harvesters harvest a package, they look up the hash corresponding to a tag that matches the packages revision, and queue for that hash.

Currently queueing harvests based on tag is duplicating work, which we don't want to do. Ideally we would need a way for both the tag and hash to reference the single definition.

vbasem commented 4 years ago

As far as I understand github releases are tag based. Tags being read only branches in git, will refer to an immutable commit hash. Hence having a connection between a hash and a tag/release version is a reasonable thing. It is just uncommon for artifacts to be published with their git hashes as version. We dont have examples of that in golang before go mod and better versioning and release concept was introduced. So in the spirit of licence checking, I believe it is more likely that the search will be performed using released artifacts. Is that agreeable?

So from my understanding, the harvester doesnt lookup the tag info as it goes through the queue. Are we able to differentiale between hashes and release versions in the data structure? From what I recognize in the data:

  revision: 24b81a50a92b6c2a4d4d8e40c52ba27653f3f07b
  type: git
described:
  files: 55
  hashes:
    gitSha: 24b81a50a92b6c2a4d4d8e40c52ba27653f3f07b

the hash is being set to both revision and gitSha. So if a search is performed using git hash, i instinctively expect that we look at gitSha field rather then revision. Which leaves the question: Does it make sense to replace revision with the release information when available?