Improve the Import Process

SEED-platform / seed

Standard Energy Efficiency Data (SEED) Platform™ is a web-based application that helps organizations easily manage data on the energy performance of large groups of buildings.

Other

107 stars 55 forks source link

Improve the Import Process #4549

Closed kflemin closed 4 weeks ago

kflemin commented 7 months ago

Per Peer Review feedback, upgrade data mapping and matching processes to make the workflow more intuitive. Include further integration of data types and dual unit support. Need to improve the performance of the importing process. Look at better scaling of worker nodes.

More details:

Profile the import process (steps 1 through 6 on the UI) - some steps seem to get skipped while others take a long time to run
- also profile each celery task and see which ones take the longest to run
- a specific example that is taking a long time (1 second for each imported record): https://github.com/SEED-platform/seed/blob/develop/seed/data_importer/match.py#L724
Investigate if there are things that can be skipped completely. for example:
- Geocoding: if the org doesn't have a mapquest key or lat/lng/UBID are not provided, skip the geocoding process
- Linking: if there's only one cycle, can we skip the whole linking process completely?
- Pairing: if there are no taxlots in the org and no taxlots in the import, can we skip the pairing process completely?
- Matching: are there improvements we can do here (matching within and without cycles?)
Can we improve the Progress API endpoint to be more fault tolerant:
- frontend: keep retrying on error?
- backend: reset the TTL of each key when it updates the progress values?

@axelstudios

axelstudios commented 5 months ago

The mapping step has room for improvement, but by far the biggest bottleneck is after you hit Confirm mappings & start matching:

Anecdotally, when uploading a file with 145,921 rows to an existing organization with 145,921 matching rows these are the celery tasks and their timing:

1x seed.data_importer.tasks._geocode_properties_or_tax_lots (35s)
1,460x seed.data_importer.tasks._map_additional_models (1m 8s)
1x seed.data_importer.match.match_and_link_incoming_properties_and_taxlots (2d 13h 55m 9s)
- It took approximately 1h 8m to get to step 3 / 6 (Matching Data (3/6): Merging Unmatched States)
- Step 3 / 6 took 15h 20m 36s to complete
- Step 6 / 6 took 1d 21h 23m 30s to complete, most of which was spent inside this loop that takes 1.1 seconds for each record in the import file
1x seed.data_importer.tasks.finish_matching (0.3s)

Total matching time: 61.94 hours (2 days 13 hours 56 minutes 33 seconds)

RDmitchell commented 2 months ago

@axelstudios / @kflemin -- I am going to test importing on staging the next few days.

Any files in particular that you would like me to test?

RDmitchell commented 2 months ago

Instance: seeddemostaging SHA: 17f3257f4

See this doc https://docs.google.com/document/d/1hztWNLqtraq_ORPpLHsmyokiO8Cu_Ehnl3FeRTFv1fQ/edit?usp=sharing

The import was decently fast, a few minutes for each step, until the very last matching / merging step, 6/6, where it seems to be "stuck". I am letting it run, but so far it's almost up to 10 minutes. The doc above has more details.

axelstudios commented 4 weeks ago

I think these issues have all been resolved