Right now, the "incrementa sync" of google drive falls back to the default naive incremental sync implementation where we have to at least fetch all document metadata and it only allows for skipping downloads of files that already exist.
Google drive incremental syncs do not use a "delta API" that would allow it to fetch only documents that changed from the last sync. E.g filtering documents at the source with e.g. q=lastiModifiedTime > syncCursor would result in much less file metadata to fetch and process during the incremental syncs and would likely result in much shorter incremental sync times.
Once we have "smart" implementation of incremental syncs we can expect a big speedup for incremental syncs for massive datasets.
Open questions
How to track deletes?
Probably we can use trashed=true and "some" time property in query, to detect recently deleted docs - more investigation needed
According to docstrashedTime is populated only for files in a shared drive
For personal drive perhaps modifiedTime is sufficient to check (check if modifiedTime can be also used for shared drives)
Make sure we can mark docs operation as deleted in get_docs_incrementally function to signal ES that a doc should be deleted - see get_docs_incrementally doc
Additional Context
Would be great to benchmark the new implementation to have rough estimate of speedup
You can ping @jedrazb with the feature branch ready, I have dataset of 80k files we can run benchmark against
Problem Description
Right now, the "incrementa sync" of google drive falls back to the default naive incremental sync implementation where we have to at least fetch all document metadata and it only allows for skipping downloads of files that already exist.
Google drive incremental syncs do not use a "delta API" that would allow it to fetch only documents that changed from the last sync. E.g filtering documents at the source with e.g.
q=lastiModifiedTime > syncCursor
would result in much less file metadata to fetch and process during the incremental syncs and would likely result in much shorter incremental sync times.Proposed Solution
get_docs_incrementally
functionget_docs
q
to list_files and list_files_from_my_drive function to filter doc that were modified recently ( last sync timestamp will be stored insync_cursor
) - list API docs (read more here)sync_cursor
Once we have "smart" implementation of incremental syncs we can expect a big speedup for incremental syncs for massive datasets.
Open questions
trashed=true
and "some" time property in query, to detect recently deleted docs - more investigation neededtrashedTime
is populated only for files in a shared drivemodifiedTime
is sufficient to check (check ifmodifiedTime
can be also used for shared drives)deleted
inget_docs_incrementally
function to signal ES that a doc should be deleted - see get_docs_incrementally docAdditional Context
Acceptance criteria