clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 39 forks source link

Version Control data as a schema attribute #760

Open byjrack opened 3 years ago

byjrack commented 3 years ago

I am getting asked more often for contextual data on a given 3rd party work that is tied to the version control system. This may be more suited for the umbrella efforts like ClearlySecure, but helping to pull together all these pointers for a given work would be helpful. Maybe ClearlyMaintained aligned with the work done in CHAOSS?

Solution'ing for this is complicated I get it. Some package managers have attributes that could be harvested, but that is very hit or miss. It feels like this would be similar to how to connect a Component to a CPE for tying to the NVD.

jeffmcaffer commented 3 years ago

@byjrack are you talking about the source location for a given package? ClearlyDefined has that now and we do our best to populate it as accurately as possible. It is also curatable. Can you give some examples of additional data you'd like to see?

byjrack commented 3 years ago

@jeffmcaffer so may have been the samples I was using as they were heavily biased toward Java ecosystem and some R "archeology" I was doing back then. A common problem I am faced with is given a JAR name (or hopefully a GAV) knowing where it is being maintained. NPM and more modern ecosystems are a bit more tightly coupled to their version control so the problem is less. I find the ClearlyDefined source to be a bit ambiguous in use with sometimes it being a registry and sometimes the version control if there is anything at all.

Example, provided "geronimo-stax-api_1.0_spec" I could map to https://clearlydefined.io/definitions/maven/mavencentral/org.apache.geronimo.specs/geronimo-stax-api_1.0_spec/1.0.1 as a fuzzy. But source just leads me back to central.maven (needs to be redone for search.maven) https://search.maven.org/artifact/org.apache.geronimo.specs/geronimo-stax-api_1.0_spec/1.0.1/bundle which in this case does include the scm tag.

So in this specific case reducing the walk of datasets is valuable. And having this data as a facet could allow for collocating maintenance type data as a facet as well, but that requires a lot of work to keep them fresh. There is a clear overlap with CHAOSS/Augur here which is visualizing maintenance health for a work, but Augur takes the VCS as input. In most cases teams only know the registry location of a work so when those connections lack the VCS data it becomes manual reconciliation. So trying to find a way to connect those dots automatically.

You can also see the SCA vendors starting to pick up the need for characterizing EOL or abandoned artifacts by pulling in this type of data (e.g https://snyk.io/advisor/). For more mature ecosystems though I find that making that connection from artifact to maintained location is complicated because the landscape has changed so much. Never mind specific cases like glassfish that have changed "owners" and platforms so many times.

That help to better explain?

jeffmcaffer commented 3 years ago

Makes sense. In the ClearlyDefined context we have focused (to date) on compliance and that drives folks mostly to want the source commit (vs just the repo or wherever the project is maintained etc). In the case of Maven, as you say, the source repo is sometimes elusive but the source jar is often available in maven central.

This came up in another issue related to Maven. AFAIK there's not a lot of Maven expertise on the team so anyone who has suggestions for how we can systematically improve the source connection is encouraged to speak up.

Beyond the source location, personally I'd be keen to enable integration with CHAOSS and ensure that ClearlyDefined definitions have enough info (in the proper form) to enable a user to go from ClearlyDefined to CHAOSS and get the more "community" related info.

byjrack commented 3 years ago

Yup this item was about connecting an artifact back to its community to allow for further analysis. That could be metrics like we see in CHAOSS or researching the lineage of the work especially through change of ownership and successful forks. That data can help with making recommendations to developers if/when a community goes stagnant. That was also the driver for the comment around "ClearlyMaintained" vs just Defined as agreed it is a bit of scope creep as it reflects our bias to what "compliance" requires.

Some ecosystems and their registries like npm have made it the norm to link back to version control, but many don't have that especially older ecosystems. And often times the metadata is not packaged with the distributable of the work (JAR, EAR, TAR, etc) so harvesting that data is more complicated. This also becomes more important where you have multi-distributable works where N JARs might come from a single development workspace because you don't have that context just looking at the artifacts themselves.