entity_id will be the generated ID for the entity type
this can be a commit_id, pullrequest_id, issue_comment_id or any other event types entity ID that is generated by our generate functions
source_entity_id is the ID of entity as pulled from the source
file_location is the filename/location in DL s3 input bucket (datalake-input- {ENV})
hash will be the hash of the raw response before it is transformed into the lfx-events-schema schemas
orphaned will be a boolean (true|false) value which initially default to false and on consecutive runs can be modified to be true when a full sync runs and the specific event can't be found
The cache will still be split between endpoints in the tree structure as it is right now. 0000-last-sync can be as it is and do the tracking as it is currently.
Biggest file here will be for the kernel projects which has over 1 million commits which will translate into a cache.csv file with over a million lines. That will end up being less than 100mb so connector can hold the entire cache in memory if needed (during a full sync).
On default run mode connector will run through the following steps:
check for the 0000-last-sync file of the endpoint and find out where to start pulling
get the cache.csv file(s) for the endpoint
start pulling data from source
for each entity, confirm data is not in the cache (compare hash of the raw data)
add to the list of objects that will be pushed to DL s3 input bucket
add to the cache
push DL s3 input file first
if successful, push the updated cache file and the updated 0000-last-sync file
On full sync mode connector will run through the following steps:
ignore the 0000-last-sync file of the endpoint
get the cache.csv file(s) for the endpoint and load into memory and set the orphaned to true for every line
start pulling data from the source starting from the beginning
for each entity, confirm data is in the cache (compare hash of the raw data)
add to the list of objects that will be pushed to DL s3 input bucket
update the cache entries orphaned field to false or add the NEW entry if not already in cache
push DL s3 input file first
if successful, push the updated cache file and the updated 0000-last-sync file
With these changes implemented we will not have to worry about deleting millions of files on a full sync.
All the cache will be in a single compact file per endpoint which is how the connectors run.
Change caching mechanism so that we keep one file per repository instead of a copy of every record
Each connector will need to keep one file per event type and be a csv.gz file with the following general format:
timestamp
is the sync timestamp (unix) epochentity_id
will be the generated ID for the entity typecommit_id
,pullrequest_id
,issue_comment_id
or any other event types entity ID that is generated by our generate functionssource_entity_id
is the ID of entity as pulled from the sourcefile_location
is the filename/location in DL s3 input bucket (datalake-input- {ENV}
)hash
will be the hash of the raw response before it is transformed into the lfx-events-schema schemasorphaned
will be a boolean (true
|false
) value which initially default tofalse
and on consecutive runs can be modified to betrue
when a full sync runs and the specific event can't be foundThe cache will still be split between endpoints in the tree structure as it is right now.
0000-last-sync
can be as it is and do the tracking as it is currently.Biggest file here will be for the kernel projects which has over 1 million commits which will translate into a cache.csv file with over a million lines. That will end up being less than 100mb so connector can hold the entire cache in memory if needed (during a full sync).
On default run mode connector will run through the following steps:
0000-last-sync
file of the endpoint and find out where to start pullingcache.csv
file(s) for the endpointhash
of the raw data)0000-last-sync
fileOn full sync mode connector will run through the following steps:
0000-last-sync
file of the endpointtrue
for every linehash
of the raw data)orphaned
field tofalse
or add the NEW entry if not already in cache0000-last-sync
fileWith these changes implemented we will not have to worry about deleting millions of files on a full sync. All the cache will be in a single compact file per endpoint which is how the connectors run.