implement new caching method

Change caching mechanism so that we keep one file per repository instead of a copy of every record

Each connector will need to keep one file per event type and be a csv.gz file with the following general format:

timestamp,entity_id,source_entity_id,file_location,hash,orphaned

timestamp is the sync timestamp (unix) epoch
entity_id will be the generated ID for the entity type
this can be a commit_id, pullrequest_id, issue_comment_id or any other event types entity ID that is generated by our generate functions
source_entity_id is the ID of entity as pulled from the source
file_location is the filename/location in DL s3 input bucket (datalake-input- {ENV})
hash will be the hash of the raw response before it is transformed into the lfx-events-schema schemas
orphaned will be a boolean (true|false) value which initially default to false and on consecutive runs can be modified to be true when a full sync runs and the specific event can't be found

The cache will still be split between endpoints in the tree structure as it is right now. 0000-last-sync can be as it is and do the tracking as it is currently.

s3://insights-v2-cache-{ENV}/cache/{CONNECTOR}/{ENDPOINT}/0000-last-sync
s3://insights-v2-cache-{ENV}/cache/{CONNECTOR}/{ENDPOINT}/{EVENT}-cache.csv

Biggest file here will be for the kernel projects which has over 1 million commits which will translate into a cache.csv file with over a million lines. That will end up being less than 100mb so connector can hold the entire cache in memory if needed (during a full sync).

On default run mode connector will run through the following steps:

check for the 0000-last-sync file of the endpoint and find out where to start pulling
get the cache.csv file(s) for the endpoint
start pulling data from source
- for each entity, confirm data is not in the cache (compare hash of the raw data)
- add to the list of objects that will be pushed to DL s3 input bucket
- add to the cache
push DL s3 input file first
if successful, push the updated cache file and the updated 0000-last-sync file

On full sync mode connector will run through the following steps:

ignore the 0000-last-sync file of the endpoint
get the cache.csv file(s) for the endpoint and load into memory and set the orphaned to true for every line
start pulling data from the source starting from the beginning
- for each entity, confirm data is in the cache (compare hash of the raw data)
- add to the list of objects that will be pushed to DL s3 input bucket
- update the cache entries orphaned field to false or add the NEW entry if not already in cache
push DL s3 input file first
if successful, push the updated cache file and the updated 0000-last-sync file

With these changes implemented we will not have to worry about deleting millions of files on a full sync. All the cache will be in a single compact file per endpoint which is how the connectors run.

LF-Engineering / insights-datasource-shared

implement new caching method #99