Support component integrity verification

This issue expands on issue in upstream dependency track. An initial POC for this has been completed and demoed using hyades-apiserver and hyades Below features need to be addressed as part of actual implementation:

### Tasks
- [ ] Allow integrity check for maven on components
- [ ] The integrity check would be enabled/ disabled universally with a single feature flag/ config in apiserver
- [ ] The meta information along with the published date and hashes would be sent back from hyades to apiserver in one message in the same topic as meta analysis result.
- [ ] When published date for a component is already present in the db, then we do not need to make the api call for it. 
- [ ] This information of component hashes and published date would be stored in a separate table. 
- [ ] This table would be used to perform hash check on components, in case info is present. If not, api call would be requested.
- [ ] Regardless of whether the integrity check is enabled/ disabled, we would fetch the information of component and store it in the db. If integrity check is enabled, the analysis would also be performed and stored. If not enabled, the analysis would not be performed
- [ ] Implement this functionality for npm and pypi as well. The initial understanding is that the source of truth used to validate integrity information uses the head request to dependency files to reveal their checksum information. This would be expanded on in the future.
- [ ] In addition to bom upload, this check should also happen on re analysis and manual addition of components.
- [ ] Project deletion should happen successfully.
- [ ] Unit tests for the entire functionality
- [ ] Future enhancement : add validation on repo URLs if integrity check can be performed on it or not.

Based on the PR review for https://github.com/DependencyTrack/hyades/pull/727 we wanted to also store the published date information from the same end point that is used to fetch integrity information for packages. However, as per current design, integrity check is an optional check that the user can enable/ disable for a repo of choice and then the integrity check would be performed for components considering the configured repo as the source of truth. This is viable for integrity checks alone. But published date is a field that we want for all components all the time and is not optional. Also, the repository from which the component is actually fetched should be used to get the published date for the component. Thus we cannot use the integrity check external call as is to support the published date feature. There are a few options on how we could get publised date too:

Use separate repos to get published date information for packages. For example, we could use maven central to get published date for maven packages. In this case, the user can configure only one source from where the published date would be fetched. If the user does not select any source, we will use maven central. In this case, the integrity check info fetch is separate from the published date config. Can also use artifactory in this sense.
Use deps.dev as the central source of truth for both integrity check and published date. In this case, allow the user to configure integrity check as currently is available. But published date is always fetched for component, if new, from deps.dev. This is not configurable by the user. The downside is that the published date feature is an additional call to a similar end point if it is not the same configuration as that used for integrity check. This is found to be part of the head request response as was done for integrity check. The header is x-modified.. to be looked for in this call.

Meeting notes from discussion on September 18 2023.

Use Cases

Integrity Verification

Detection of "smuggled" packages (Replacing of packages in the package repository)
- Modified package was resolved from internal repo during build, and is included in SBOM
- Comparison of package hashes from SBOM with hashes from Maven Central will yield a mismatch
Detection of Man In The Middle’d packages
- Bad actor replaces packages during build when package manager fetches them from repository
Nice-to-have: There should be metrics of how many integrity violations occurred per component / project
- Doesn’t add too much value because metrics in the UI are not used that much
It is not necessary to configure repositories specifically for integrity checks; Use the same repository for latest version check, integrity check, published date
Integrity check enabled/disabled should be a feature flag at first
- When disabled, analysis and metrics inclusion would not happen
- Fetching of the data (hashes, published date) will happen no matter if enabled or not

Multiple Repositories for integrity verification

Being able to use Artifactory / other custom repositories
Configuration would be priority-based, such that Maven Central can be assigned a higher priority than internal repos
- Fall-through logic; Iterate through all repos, return on first match, otherwise proceed to next repo
Assumption: Only a single repository should match, we only need to store results from one, and not multiple
Future: deps.dev could be added as another repository

pkg:maven/com.citi/citi-lib → Artifactory (Internal) pkg:maven/org.springframework/spring-core → Maven Central

Published Date

"Component Age" policies
Displaying published date in UI / return in REST API

dtrack.repo-meta-analysis.component

Consumer listening for this topic will fetch:
- Latest Version
- Published Date
- Hashes
Problems:
- Latest Version changes over time, published date and hashes (hopefully) do not More precise details and flow to be added by @VithikaS

[x] 1. New table to store hashes and published date @sahibamittal

`
CREATE TABLE IF NOT EXISTS public."COMPONENT_METADATA"
(
    "ID" bigint NOT NULL,
    "PURL" character varying(1024) NOT NULL,
    "MD5" character varying(1024),
    "SHA1_HASH" character varying(1024),
    "SHA256_HASH" character varying(1024),
    "PUBLISHED_AT" timestamp with time zone,
    "LAST_FETCH" timestamp with time zone,
    "STATUS" character varying(255),
    CONSTRAINT "COMPONENT_METADATA_PK" PRIMARY KEY ("ID")

)`

STATUS : Possible values PROCESSED, TIMED_OUT. There can be additional values

Create Index on purl, last_fetch, published_at
Data model and queries for new table in api-server
[x] 2. @sahibamittal Create Initialiser which will kick off update of new table with hash information and publishedAt date on application startup

Initialiser
Select count(ID) from COMPONENT table. If count is 0, exit the Intialiser flow. This will only happen when DT is deployed on fresh db with no data.
Select count(ID) from COMPONENT_METADATA table
- If count == 0

Insert into COMPONENT_METADATA table Select DISTINCT purls, internal from COMPONENT table in one transaction

TODO: check if there is copy command of postgres which can be used or any other better way of doing it. This is with assumption, copying of purls from COMPONENT table to COMPONENT_METADATA table happens in one transaction so either all required purls are copied or none are. Any count higher than 0 would mean that purls were already copied over from COMPONENT table.

SELECT INTO is much faster than INSERT .. SELECT. Hint could be provided to use table level lock to improve performance of INSERT.. SELECT
INSERT .. SELECT will apply table level lock on table rows are selected from so executing this in parallel to Bom upload can have significant performance impact.

With small load of ~6k distinct purls, INSERT INTO .. SELECT execution took ~ 409ms.

COPY is most optimal for copying bulk data in postgres As per postgres documentation

Note that loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.

COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway.

COPY is used to COPY FROM a file or COPY TO a file so may not be ideal choice.

Fetch components from table in batches of 5000. We are already fetching pages of components in other places like performing portfolio repo meta analysis so should be similar or same

"Where" clause to fetch components should check that LAST_FETCH time is either null or an hour before current time, purl hashes, published date should be null as well. Alternatively, this could be checked with STATUS field. LAST_FETCH is to prevent same purl is not selected twice in short duration to fetch metadata. STATUS field in table will be updated even when we get no data from any of the configured repositories. This is to prevent resending those purls once all configured repositories have been queried to get metadata.
- batch update the LAST_FETCH to NOW() in database for all selected components
- use the topic we have for repo meta analysis to dispatch event. We have to send a different command on the topic than what we send for version check on repo meta analysis. Version check is a recurring task because latest version will keep changing. Hashes and PublishedAt is static information which should remain same. We should be able to differentiate if we want to perform
[x] 3. @sahibamittal Repo Meta Analyser - read command from dtrack.repo-meta-analysis.component topic to get component metadata. This does not include the latest version
- Fetch component metadata from the applicable repository with the appropriate fallback in order of priority
- Result from Repo Meta Analyser should include Component Metadata
[x] 4. Api-server - synchronize Repository Meta Component should update COMPONENT_METADATA table with results. If integrity check is globally enabled, it will be evaluated
When updating database, batching could be considered to keep it more efficient.
We could potentially lose records if we batch and api server restarts before records are committed to database. Those purls will be picked again to fetch data so we would recover eventually. But this would result in more work and should be avoided.

It would be better to use changelog topic and state store so in case of restart, in-memory state store is reconstructed from changelog topic.

If we go on to batch records for other database update operations in api-server, there could be a consideration to write our own kafka consumer. It could enable us to commit offset after database changes have been committed. This was discussed to be considered separately.
[ ] 5. @mehab On Bom Upload, Api Server to send an event on kafka topic only if metadata is not present in database. Otherwise it should send event to just fetch latest version
- Analysis Result should be update with component metadata.
- Repo Meta Analyser will receive different commands on topic when it needs to check only version information or only metadata or when it needs to perform both operations
[ ] Api server to perform integrity check if globally enabled

Task Highlights

[x] Create new table and queries. @sahibamittal
[x] Create Initialiser. @sahibamittal
[x] Hyades-apiserver changes (bom-upload dispatching event; updating new table after repo-meta analysis results). @mehab
[x] Hyades-apiserver changes (perform integrity analysis if enabled and update db). @VithikaS
[x] Changes at hyades end (repo-meta analyzer : receiving event, calling repositories, sending data back in repo-meta analysis results). @sahibamittal
[x] End-to-end testing locally.
[ ] New endpoint in apiserver to return integrity analysis data for a component. @mehab
[ ] Upon deletion of project/component, recursively delete integrity analysis for the component. @VithikaS
[ ] UI changes -> TBD @sahibamittal

AnalysisResult Proto should include fields for Component Metadata

message AnalysisResult {
  // The component this result is for.
  Component component = 1;

  // Identifier of the repository where the result was found.
  optional string repository = 2;

  // Latest version of the component.
  optional string latest_version = 3;

  // When the latest version was published.
  optional google.protobuf.Timestamp published = 4;

  // Integrity metadata of the component.
  optional IntegrityMeta integrity_meta = 5;
}

Logic from result at apiserver

integrityMeta not set --> Integrity metadata was not fetched. integrityMeta set && (it's hashes || date not set) -> Integrity metadata was fetched but not available or faced error.

message IntegrityMeta {
  optional string md5 = 1;
  optional string sha1 = 2;
  optional string sha256 = 3;
  optional string sha512 = 4;
  // When the component current version last modified.
  optional google.protobuf.Timestamp current_version_last_modified = 5;
  // Complete URL to fetch integrity metadata of the component.
  optional string meta_source_url = 6;
}

AnalysisCommand Proto :

`
message AnalysisCommand {
  // The component that shall be analyzed.
  Component component = 1;
  bool fetch_integrity_data = 2;
  bool fetch_latest_version = 3;
}

Analysis Command notes:

fetch_latest_version flag → map latest_version and latest_version_published. fetch_integrity_data flag → map current_version_published and hashes info.

initializer → [fetch_integrity_data=true, fetch_latest_version=false]
bom-upload → if integrity data doesn't exist then [fetch_integrity_data=true, fetch_latest_version=true] else [fetch_integrity_data=false, fetch_latest_version=true]
scheduled anaysis → [fetch_integrity_data=false, fetch_latest_version=true]

A point to note is the change of repository from user when he has already supplied a repository for a package type. If this happens, currently the projects/ components for which the integrity information has already been fetched will not be refreshed with new information. In order to support this feature, keeping in mind that we only support one repository at a time for a given package and not act as mirror for multiple repositories, we could re factor the initializer code to get triggered on whenever the user changes repository url and refresh information for all the existing components. And then newer projects and components can be filled for information as existing functionality

the changes have now been completed hence closing the issue

DependencyTrack / hyades

Support component integrity verification #699

Task Highlights