LF-Engineering / insights-datasource-shared

Shared library used by insights-datasources
MIT License
4 stars 5 forks source link

implement new caching method #99

Closed fayazg closed 1 year ago

fayazg commented 1 year ago

Change caching mechanism so that we keep one file per repository instead of a copy of every record

Each connector will need to keep one file per event type and be a csv.gz file with the following general format:

timestamp,entity_id,source_entity_id,file_location,hash,orphaned 

The cache will still be split between endpoints in the tree structure as it is right now. 0000-last-sync can be as it is and do the tracking as it is currently.

s3://insights-v2-cache-{ENV}/cache/{CONNECTOR}/{ENDPOINT}/0000-last-sync
s3://insights-v2-cache-{ENV}/cache/{CONNECTOR}/{ENDPOINT}/{EVENT}-cache.csv

Biggest file here will be for the kernel projects which has over 1 million commits which will translate into a cache.csv file with over a million lines. That will end up being less than 100mb so connector can hold the entire cache in memory if needed (during a full sync).

On default run mode connector will run through the following steps:

On full sync mode connector will run through the following steps:

With these changes implemented we will not have to worry about deleting millions of files on a full sync. All the cache will be in a single compact file per endpoint which is how the connectors run.

khalifapro commented 1 year ago