Parallelization of syncGPXdata

martinheppner commented 1 year ago

In sync.js there is the function syncGPXdata. The purpose of the function is to fetch datachunks from the remote "vw_gpx_to_search" and store these into the local table "gpx".

The amount of data is huge. A delta load mechanism is worth thinking about. As a first step the while loop should run in a limited way in parallel. 5 to 10 instances can read and write in parallel without a problem.

export async function syncGPXdata(){
    try {
        await knex.raw(`TRUNCATE gpx;`);
    } catch(err){
        console.log('error: ', err)
    }

    let limit = 5000;
    const query_count = await knexTourenDb('vw_gpx_to_search').count('* as anzahl'); 
    let count_gpx = query_count[0]["anzahl"];
    let count_chunks = round(count_gpx / limit, 0);
    let counter = 0;

    console.log('GPX data count_chunks:', count_chunks)
    console.log('Info: Handling ', count_gpx.toLocaleString("de-de"), ' rows with gpx datapoints.');

    /* The following loop has to be parallised */
    while(counter < count_chunks){
        const result_query = knexTourenDb('vw_gpx_to_search').select('provider', 'hashed_url', 'typ', 'waypoint', 'lat', 'lon', 'ele').whereRaw(`ROUND(lat*lon*10000) % ${count_chunks} = ${counter}`);
        const result = await result_query;

        try {
            await knex('gpx').insert([...result]);
        } catch(err){
            console.log('error: ', err)
        }
        counter++;
    }
}

martinheppner commented 9 months ago

Has been fixed as a delta load. Not sure if this is still necessary. Reducing priority.

martinheppner commented 8 months ago

Not necessary anymore, is loaded externally on prod. Full load on uat fast enough.

bahnzumberg / zuugle-api

Parallelization of syncGPXdata #22