algolia / firestore-algolia-search

Apache License 2.0
112 stars 35 forks source link

Migration of large dataset returns Error: 9 FAILED_PRECONDITION: The requested snapshot version is too old #101

Closed rayanoncyber closed 2 years ago

rayanoncyber commented 2 years ago

Hey!

We're trying to migrate our firestore data to Algolia (~5 million records) and we keep facing this issue. Seems like it's related to firestore failing to fetch that much data. Any known or upcoming fix to batch fetch?

Thanks a lot!

Haroenv commented 2 years ago

Hi @rayanoncyber, where are you getting this error? Do you have a stack trace or similar? I don't recognise this error from this code, so I wonder if it's maybe in Firebase itself?

rayanoncyber commented 2 years ago

This is what I get about 5 minutes after running the command

Screen Shot 2022-03-23 at 11 13 43 AM

Thanks @Haroenv

Haroenv commented 2 years ago

Not too sure what can cause this, but it is implied the individual connection is too "old", and it should be processed in batches (eg. https://stackoverflow.com/questions/64712587/google-cloud-pub-sub-function-gives-the-requested-snapshot-version-is-too-old, https://github.com/firebase/functions-samples/issues/890, https://github.com/firebase/firebase-js-sdk/blob/5ad7ff2ae955c297556223e6cb3ad9d4b897f664/packages/firestore/src/remote/rpc_error.ts#L76)

rayanoncyber commented 2 years ago

The error came from the firestore-algolia-search module trying to fetch all documents at once (with big datasets it fails).

I re-wrote retrieveDataFromFirestore to take care of it in sequences through a recursive function. Hope it can help some people with the same problem

const retrieveChunk = async (lastVisible, maxLength) => {
    const collectionPathParts = config_1.default.collectionPath.split('/');
    const collectionPath = collectionPathParts[collectionPathParts.length - 1];
    let querySnapshot;

    if (lastVisible) {
        querySnapshot = await database.collection(collectionPath).limit(maxLength).startAfter(lastVisible).get();
    }
    else {
        querySnapshot = await database.collection(collectionPath).limit(maxLength).get();
    }

    processQuery(querySnapshot).catch(console.error);

    return [querySnapshot.docs[querySnapshot.docs.length - 1], querySnapshot.docs.length];
};

const retrieveDataFromFirestore = async (lastVisible=null) => {
    const maxLength = 100;
    const [keepGoing, length] = await retrieveChunk(lastVisible, maxLength);
    console.log("LENGTH", length);
    if (length == maxLength) retrieveDataFromFirestore(keepGoing);
};
Haroenv commented 2 years ago

This is very useful, thanks! If you are confident in this code, feel free to make a pull request :)

smomin commented 2 years ago

@rayanoncyber did you want to send a PR?

adamunchained commented 2 years ago

@rayanoncyber did you want to send a PR?

It would have been more practical to implement it instead of closing ticket as it makes migration tool useless for bigger datasets.

aagarcia commented 2 years ago

someone was make this change for support large collections.

ddimitrioglo commented 1 year ago
if (lastVisible) {
    querySnapshot = await database.collection(collectionPath).limit(maxLength).startAfter(lastVisible).get();
} else {
    querySnapshot = await database.collection(collectionPath).limit(maxLength).get();
}

Curious if this approach (using .get()) can guarantee the correct order of execution... Usually, chunks go together with sorting.

ddimitrioglo commented 1 year ago

Based on the answers above, I end up with a small wrapper function that helped me to go through all Firestore records:

async *iterateAll<T extends DocumentData>(
    collection: CollectionReference<T>,
    orderBy: Extract<keyof T, string>,
    direction: OrderByDirection = 'desc',
    batchSize: number = 2000,
): AsyncGenerator<QueryDocumentSnapshot<T>> {
    let offset = 0;
    let shouldContinue = true;
    do {
        const query: Query<T> = collection.orderBy(orderBy, direction);
        const querySnapshot: QuerySnapshot<T> = await query.limit(batchSize).offset(offset).get();

        for (const doc of querySnapshot.docs) {
            offset++;
            yield doc;
        }

        if (querySnapshot.docs.length < batchSize) {
            shouldContinue = false;
        }
    } while (shouldContinue);
}

and used it the following way:

for await (const doc of this.iterateAll<IOrder>(firestore.collection('orders'), 'createdAt')) {
    if (!doc.exists) {
        continue;
    }

    const order = doc.data();
   //  handle every order here  
}   

Hope that helps someone 😉

P.S. instead of createdAt you can use any field that exists in all records and gives persistent ordering