DependencyTrack / hyades

Incubating project for decoupling responsibilities from Dependency-Track's monolithic API server into separate, scalable services.
https://dependencytrack.github.io/hyades/latest
Apache License 2.0
58 stars 18 forks source link

Support component integrity verification #699

Closed mehab closed 10 months ago

mehab commented 1 year ago

This issue expands on issue in upstream dependency track. An initial POC for this has been completed and demoed using hyades-apiserver and hyades Below features need to be addressed as part of actual implementation:

### Tasks
- [ ] Allow integrity check for maven on components
- [ ] The integrity check would be enabled/ disabled universally with a single feature flag/ config in apiserver
- [ ] The meta information along with the published date and hashes would be sent back from hyades to apiserver in one message in the same topic as meta analysis result.
- [ ] When published date for a component is already present in the db, then we do not need to make the api call for it. 
- [ ] This information of component hashes and published date would be stored in a separate table. 
- [ ] This table would be used to perform hash check on components, in case info is present. If not, api call would be requested.
- [ ] Regardless of whether the integrity check is enabled/ disabled, we would fetch the information of component and store it in the db. If integrity check is enabled, the analysis would also be performed and stored. If not enabled, the analysis would not be performed
- [ ] Implement this functionality for npm and pypi as well. The initial understanding is that the source of truth used to validate integrity information uses the head request to dependency files to reveal their checksum information. This would be expanded on in the future.
- [ ] In addition to bom upload, this check should also happen on re analysis and manual addition of components.
- [ ] Project deletion should happen successfully.
- [ ] Unit tests for the entire functionality
- [ ] Future enhancement : add validation on repo URLs if integrity check can be performed on it or not.
mehab commented 11 months ago

Based on the PR review for https://github.com/DependencyTrack/hyades/pull/727 we wanted to also store the published date information from the same end point that is used to fetch integrity information for packages. However, as per current design, integrity check is an optional check that the user can enable/ disable for a repo of choice and then the integrity check would be performed for components considering the configured repo as the source of truth. This is viable for integrity checks alone. But published date is a field that we want for all components all the time and is not optional. Also, the repository from which the component is actually fetched should be used to get the published date for the component. Thus we cannot use the integrity check external call as is to support the published date feature. There are a few options on how we could get publised date too:

  1. Use separate repos to get published date information for packages. For example, we could use maven central to get published date for maven packages. In this case, the user can configure only one source from where the published date would be fetched. If the user does not select any source, we will use maven central. In this case, the integrity check info fetch is separate from the published date config. Can also use artifactory in this sense.
  2. Use deps.dev as the central source of truth for both integrity check and published date. In this case, allow the user to configure integrity check as currently is available. But published date is always fetched for component, if new, from deps.dev. This is not configurable by the user. The downside is that the published date feature is an additional call to a similar end point if it is not the same configuration as that used for integrity check. This is found to be part of the head request response as was done for integrity check. The header is x-modified.. to be looked for in this call.
mehab commented 11 months ago

Meeting notes from discussion on September 18 2023.

Use Cases

Integrity Verification

Multiple Repositories for integrity verification

pkg:maven/com.citi/citi-lib → Artifactory (Internal) pkg:maven/org.springframework/spring-core → Maven Central

Published Date

dtrack.repo-meta-analysis.component

VithikaS commented 11 months ago
`
CREATE TABLE IF NOT EXISTS public."COMPONENT_METADATA"
(
    "ID" bigint NOT NULL,
    "PURL" character varying(1024) NOT NULL,
    "MD5" character varying(1024),
    "SHA1_HASH" character varying(1024),
    "SHA256_HASH" character varying(1024),
    "PUBLISHED_AT" timestamp with time zone,
    "LAST_FETCH" timestamp with time zone,
    "STATUS" character varying(255),
    CONSTRAINT "COMPONENT_METADATA_PK" PRIMARY KEY ("ID")

)`

STATUS : Possible values PROCESSED, TIMED_OUT. There can be additional values

Insert into COMPONENT_METADATA table Select DISTINCT purls, internal from COMPONENT table in one transaction

TODO: check if there is copy command of postgres which can be used or any other better way of doing it. This is with assumption, copying of purls from COMPONENT table to COMPONENT_METADATA table happens in one transaction so either all required purls are copied or none are. Any count higher than 0 would mean that purls were already copied over from COMPONENT table.

SELECT INTO is much faster than INSERT .. SELECT. Hint could be provided to use table level lock to improve performance of INSERT.. SELECT
INSERT .. SELECT will apply table level lock on table rows are selected from so executing this in parallel to Bom upload can have significant performance impact.

With small load of ~6k distinct purls, INSERT INTO .. SELECT execution took ~ 409ms.

COPY is most optimal for copying bulk data in postgres As per postgres documentation

Note that loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.

COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway.

COPY is used to COPY FROM a file or COPY TO a file so may not be ideal choice.

It would be better to use changelog topic and state store so in case of restart, in-memory state store is reconstructed from changelog topic.

sahibamittal commented 11 months ago

Task Highlights

sahibamittal commented 11 months ago

AnalysisResult Proto should include fields for Component Metadata

message AnalysisResult {
  // The component this result is for.
  Component component = 1;

  // Identifier of the repository where the result was found.
  optional string repository = 2;

  // Latest version of the component.
  optional string latest_version = 3;

  // When the latest version was published.
  optional google.protobuf.Timestamp published = 4;

  // Integrity metadata of the component.
  optional IntegrityMeta integrity_meta = 5;
}

Logic from result at apiserver

integrityMeta not set --> Integrity metadata was not fetched. integrityMeta set && (it's hashes || date not set) -> Integrity metadata was fetched but not available or faced error.

message IntegrityMeta {
  optional string md5 = 1;
  optional string sha1 = 2;
  optional string sha256 = 3;
  optional string sha512 = 4;
  // When the component current version last modified.
  optional google.protobuf.Timestamp current_version_last_modified = 5;
  // Complete URL to fetch integrity metadata of the component.
  optional string meta_source_url = 6;
}

AnalysisCommand Proto :

`
message AnalysisCommand {
  // The component that shall be analyzed.
  Component component = 1;
  bool fetch_integrity_data = 2;
  bool fetch_latest_version = 3;
}

Analysis Command notes:

fetch_latest_version flag → map latest_version and latest_version_published. fetch_integrity_data flag → map current_version_published and hashes info.

  1. initializer → [fetch_integrity_data=true, fetch_latest_version=false]
  2. bom-upload → if integrity data doesn't exist then [fetch_integrity_data=true, fetch_latest_version=true] else [fetch_integrity_data=false, fetch_latest_version=true]
  3. scheduled anaysis → [fetch_integrity_data=false, fetch_latest_version=true]
mehab commented 11 months ago

A point to note is the change of repository from user when he has already supplied a repository for a package type. If this happens, currently the projects/ components for which the integrity information has already been fetched will not be refreshed with new information. In order to support this feature, keeping in mind that we only support one repository at a time for a given package and not act as mirror for multiple repositories, we could re factor the initializer code to get triggered on whenever the user changes repository url and refresh information for all the existing components. And then newer projects and components can be filled for information as existing functionality

mehab commented 10 months ago

the changes have now been completed hence closing the issue