Add parallelization to import tasks - Githubissues

SEED-platform / seed

Standard Energy Efficiency Data (SEED) Platform™ is a web-based application that helps organizations easily manage data on the energy performance of large groups of buildings.

Other

106 stars 55 forks source link

Add parallelization to import tasks #4690

Closed perryr16 closed 2 weeks ago

perryr16 commented 2 weeks ago

Any background context you want to provide?

To import a file of 500 records with matching fields, but different "notes" took 90 seconds. After improvements, importing took 16 seconds.

What's this PR do?

Moves a db query outside of a loop - saved 25 seconds
Chunks incoming data and runs task match_and_link_incoming_properties_and_taxlots_by_cycle in parallel chunks, using the number of celery workers to determine the number of parallel tasks. Results are aggregated at the conclusion of all tasks. With 5 tasks, all tasks complete within 16 seconds.

How should this be manually tested?

upload files, monitor flower

What are the relevant tickets?

Screenshots (if appropriate)

Screenshot 2024-06-12 at 10 52 38 AM

perryr16 commented 2 weeks ago

Parallelizing the entire match_and_merge task breaks when duplicate properties exist in an import file. We need to be more precise if we are to use parallel tasks