Closed mehab closed 10 months ago
Based on the PR review for https://github.com/DependencyTrack/hyades/pull/727 we wanted to also store the published date information from the same end point that is used to fetch integrity information for packages. However, as per current design, integrity check is an optional check that the user can enable/ disable for a repo of choice and then the integrity check would be performed for components considering the configured repo as the source of truth. This is viable for integrity checks alone. But published date is a field that we want for all components all the time and is not optional. Also, the repository from which the component is actually fetched should be used to get the published date for the component. Thus we cannot use the integrity check external call as is to support the published date feature. There are a few options on how we could get publised date too:
Meeting notes from discussion on September 18 2023.
Use Cases
Integrity Verification
Multiple Repositories for integrity verification
pkg:maven/com.citi/citi-lib → Artifactory (Internal) pkg:maven/org.springframework/spring-core → Maven Central
Published Date
dtrack.repo-meta-analysis.component
`
CREATE TABLE IF NOT EXISTS public."COMPONENT_METADATA"
(
"ID" bigint NOT NULL,
"PURL" character varying(1024) NOT NULL,
"MD5" character varying(1024),
"SHA1_HASH" character varying(1024),
"SHA256_HASH" character varying(1024),
"PUBLISHED_AT" timestamp with time zone,
"LAST_FETCH" timestamp with time zone,
"STATUS" character varying(255),
CONSTRAINT "COMPONENT_METADATA_PK" PRIMARY KEY ("ID")
)`
STATUS : Possible values PROCESSED, TIMED_OUT. There can be additional values
Create Index on purl, last_fetch, published_at
Data model and queries for new table in api-server
[x] 2. @sahibamittal Create Initialiser which will kick off update of new table with hash information and publishedAt date on application startup
Initialiser
Select count(ID) from COMPONENT table. If count is 0, exit the Intialiser flow. This will only happen when DT is deployed on fresh db with no data.
Select count(ID) from COMPONENT_METADATA table
Insert into COMPONENT_METADATA table Select DISTINCT purls, internal from COMPONENT table in one transaction
TODO: check if there is copy command of postgres which can be used or any other better way of doing it. This is with assumption, copying of purls from COMPONENT table to COMPONENT_METADATA table happens in one transaction so either all required purls are copied or none are. Any count higher than 0 would mean that purls were already copied over from COMPONENT table.
SELECT INTO is much faster than INSERT .. SELECT. Hint could be provided to use table level lock to improve performance of INSERT.. SELECT
INSERT .. SELECT will apply table level lock on table rows are selected from so executing this in parallel to Bom upload can have significant performance impact.
With small load of ~6k distinct purls, INSERT INTO .. SELECT execution took ~ 409ms.
COPY is most optimal for copying bulk data in postgres As per postgres documentation
Note that loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.
COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway.
COPY is used to COPY FROM a file or COPY TO a file so may not be ideal choice.
Fetch components from table in batches of 5000. We are already fetching pages of components in other places like performing portfolio repo meta analysis so should be similar or same
"Where" clause to fetch components should check that LAST_FETCH time is either null or an hour before current time, purl hashes, published date should be null as well. Alternatively, this could be checked with STATUS field. LAST_FETCH is to prevent same purl is not selected twice in short duration to fetch metadata. STATUS field in table will be updated even when we get no data from any of the configured repositories. This is to prevent resending those purls once all configured repositories have been queried to get metadata.
[x] 3. @sahibamittal Repo Meta Analyser - read command from dtrack.repo-meta-analysis.component topic to get component metadata. This does not include the latest version
[x] 4. Api-server - synchronize Repository Meta Component should update COMPONENT_METADATA table with results. If integrity check is globally enabled, it will be evaluated
When updating database, batching could be considered to keep it more efficient.
We could potentially lose records if we batch and api server restarts before records are committed to database. Those purls will be picked again to fetch data so we would recover eventually. But this would result in more work and should be avoided.
It would be better to use changelog topic and state store so in case of restart, in-memory state store is reconstructed from changelog topic.
If we go on to batch records for other database update operations in api-server, there could be a consideration to write our own kafka consumer. It could enable us to commit offset after database changes have been committed. This was discussed to be considered separately.
[ ] 5. @mehab On Bom Upload, Api Server to send an event on kafka topic only if metadata is not present in database. Otherwise it should send event to just fetch latest version
[ ] Api server to perform integrity check if globally enabled
AnalysisResult Proto should include fields for Component Metadata
message AnalysisResult {
// The component this result is for.
Component component = 1;
// Identifier of the repository where the result was found.
optional string repository = 2;
// Latest version of the component.
optional string latest_version = 3;
// When the latest version was published.
optional google.protobuf.Timestamp published = 4;
// Integrity metadata of the component.
optional IntegrityMeta integrity_meta = 5;
}
Logic from result at apiserver
integrityMeta
not set --> Integrity metadata was not fetched.
integrityMeta
set && (it's hashes || date not set) -> Integrity metadata was fetched but not available or faced error.
message IntegrityMeta {
optional string md5 = 1;
optional string sha1 = 2;
optional string sha256 = 3;
optional string sha512 = 4;
// When the component current version last modified.
optional google.protobuf.Timestamp current_version_last_modified = 5;
// Complete URL to fetch integrity metadata of the component.
optional string meta_source_url = 6;
}
AnalysisCommand Proto :
`
message AnalysisCommand {
// The component that shall be analyzed.
Component component = 1;
bool fetch_integrity_data = 2;
bool fetch_latest_version = 3;
}
Analysis Command notes:
fetch_latest_version flag → map latest_version and latest_version_published. fetch_integrity_data flag → map current_version_published and hashes info.
A point to note is the change of repository from user when he has already supplied a repository for a package type. If this happens, currently the projects/ components for which the integrity information has already been fetched will not be refreshed with new information. In order to support this feature, keeping in mind that we only support one repository at a time for a given package and not act as mirror for multiple repositories, we could re factor the initializer code to get triggered on whenever the user changes repository url and refresh information for all the existing components. And then newer projects and components can be filled for information as existing functionality
the changes have now been completed hence closing the issue
This issue expands on issue in upstream dependency track. An initial POC for this has been completed and demoed using hyades-apiserver and hyades Below features need to be addressed as part of actual implementation: