Closed kflemin closed 4 weeks ago
The mapping step has room for improvement, but by far the biggest bottleneck is after you hit Confirm mappings & start matching
:
Anecdotally, when uploading a file with 145,921 rows to an existing organization with 145,921 matching rows these are the celery tasks and their timing:
seed.data_importer.tasks._geocode_properties_or_tax_lots
(35s)seed.data_importer.tasks._map_additional_models
(1m 8s)seed.data_importer.match.match_and_link_incoming_properties_and_taxlots
(2d 13h 55m 9s)
3 / 6
(Matching Data (3/6): Merging Unmatched States
)3 / 6
took 15h 20m 36s to complete6 / 6
took 1d 21h 23m 30s to complete, most of which was spent inside this loop that takes 1.1 seconds for each record in the import fileseed.data_importer.tasks.finish_matching
(0.3s)Total matching time: 61.94 hours (2 days 13 hours 56 minutes 33 seconds)
@axelstudios / @kflemin -- I am going to test importing on staging the next few days.
Any files in particular that you would like me to test?
Instance: seeddemostaging SHA: 17f3257f4
See this doc https://docs.google.com/document/d/1hztWNLqtraq_ORPpLHsmyokiO8Cu_Ehnl3FeRTFv1fQ/edit?usp=sharing
The import was decently fast, a few minutes for each step, until the very last matching / merging step, 6/6, where it seems to be "stuck". I am letting it run, but so far it's almost up to 10 minutes. The doc above has more details.
I think these issues have all been resolved
Per Peer Review feedback, upgrade data mapping and matching processes to make the workflow more intuitive. Include further integration of data types and dual unit support. Need to improve the performance of the importing process. Look at better scaling of worker nodes.
More details:
Profile the import process (steps 1 through 6 on the UI) - some steps seem to get skipped while others take a long time to run
Investigate if there are things that can be skipped completely. for example:
Can we improve the Progress API endpoint to be more fault tolerant:
@axelstudios