dolthub / dolthub-etl-jobs

ETL jobs that DoltHub maintained that load public data into DoltHub.
Apache License 2.0
18 stars 9 forks source link

airflow_dags/corona-virus/import-data.pl: Use only inserts for obs.sql, instead of inserts and updates. #222

Closed reltuk closed 4 years ago

reltuk commented 4 years ago

Inserts are optimized for dolt import, but UPDATEs can be very slow.

reltuk commented 4 years ago

Example obs.sql output for the new code:

insert into cases (place_id, observation_time, death_count, confirmed_count, recovered_count) values (173, '2020-09-19 00:00:00', 15940, 659656, 589434);
insert into cases (place_id, observation_time, confirmed_count) values (173, '2020-03-16 00:00:00', 62);
insert into cases (place_id, observation_time, recovered_count, confirmed_count, death_count) values (173, '2020-05-22 00:00:00', 10104, 20125, 397);
insert into cases (place_id, observation_time, recovered_count, death_count, confirmed_count) values (173, '2020-06-25 00:00:00', 59974, 2292, 118375);
insert into cases (place_id, observation_time, death_count, confirmed_count) values (173, '2020-04-26 00:00:00', 87, 4546);
insert into cases (place_id, observation_time, recovered_count, confirmed_count, death_count) values (173, '2020-08-02 00:00:00', 347227, 511485, 8366);
insert into cases (place_id, observation_time, confirmed_count, death_count) values (173, '2020-05-08 00:00:00', 8895, 178);
insert into cases (place_id, observation_time, death_count, confirmed_count) values (173, '2020-05-07 00:00:00', 161, 8232);
insert into cases (place_id, observation_time, death_count, confirmed_count, recovered_count) values (173, '2020-07-08 00:00:00', 3602, 224665, 106842);
insert into cases (place_id, observation_time, confirmed_count, death_count, recovered_count) values (173, '2020-05-04 00:00:00', 7220, 138, 2746);
insert into cases (place_id, observation_time, death_count, confirmed_count) values (173, '2020-04-27 00:00:00', 90, 4793);
insert into cases (place_id, observation_time, confirmed_count) values (173, '2020-03-23 00:00:00', 402);
insert into cases (place_id, observation_time, recovered_count, death_count, confirmed_count) values (173, '2020-08-10 00:00:00', 417200, 10621, 563598);
insert into cases (place_id, observation_time, confirmed_count) values (173, '2020-04-09 00:00:00', 1934);
insert into cases (place_id, observation_time, recovered_count, confirmed_count, death_count) values (173, '2020-06-14 00:00:00', 38531, 70038, 1480);
insert into cases (place_id, observation_time, death_count, confirmed_count, recovered_count) values (173, '2020-07-11 00:00:00', 3971, 264184, 127715);
insert into cases (place_id, observation_time, confirmed_count) values (173, '2020-03-15 00:00:00', 51);
insert into cases (place_id, observation_time, recovered_count, death_count, confirmed_count) values (173, '2020-09-08 00:00:00', 567729, 15086, 640441);
insert into cases (place_id, observation_time, death_count, confirmed_count, recovered_count) values (173, '2020-06-11 00:00:00', 1284, 58568, 33252);
insert into cases (place_id, observation_time, recovered_count, death_count, confirmed_count) values (173, '2020-09-06 00:00:00', 563891, 14889, 638517);
insert into cases (place_id, observation_time, recovered_count, confirmed_count, death_count) values (173, '2020-09-05 00:00:00', 561204, 636884, 14779);
reltuk commented 4 years ago

Old code corresponding insert and update statements for that last row:

insert into cases (place_id, observation_time) values (173, '2020-09-05 00:00:00');
update cases set death_count = 14779 where place_id=173 and observation_time='2020-09-05 00:00:00';
update cases set confirmed_count = 636884 where place_id=173 and observation_time='2020-09-05 00:00:00';
update cases set recovered_count = 561204 where place_id=173 and observation_time='2020-09-05 00:00:00';
reltuk commented 4 years ago

The UPDATE version on my laptop takes many minutes to run...I canceled it after five, and it had inserted 7,000 of the 37,000 rows at that point. The INSERT version takes < 5 seconds:

aaronson@Aarons-MacBook-Pro import-data % time dolt sql < obs.sql
Query OK, 37927 rows affected
Rows inserted: 1000 Rows updated: 0 Rows deleted: 0
Rows inserted: 2000 Rows updated: 0 Rows deleted: 0
Rows inserted: 3000 Rows updated: 0 Rows deleted: 0
Rows inserted: 4000 Rows updated: 0 Rows deleted: 0
Rows inserted: 5000 Rows updated: 0 Rows deleted: 0
Rows inserted: 6000 Rows updated: 0 Rows deleted: 0
Rows inserted: 7000 Rows updated: 0 Rows deleted: 0
Rows inserted: 8000 Rows updated: 0 Rows deleted: 0
Rows inserted: 9000 Rows updated: 0 Rows deleted: 0
Rows inserted: 10000 Rows updated: 0 Rows deleted: 0
Rows inserted: 11000 Rows updated: 0 Rows deleted: 0
Rows inserted: 12000 Rows updated: 0 Rows deleted: 0
Rows inserted: 13000 Rows updated: 0 Rows deleted: 0
Rows inserted: 14000 Rows updated: 0 Rows deleted: 0
Rows inserted: 15000 Rows updated: 0 Rows deleted: 0
Rows inserted: 16000 Rows updated: 0 Rows deleted: 0
Rows inserted: 17000 Rows updated: 0 Rows deleted: 0
Rows inserted: 18000 Rows updated: 0 Rows deleted: 0
Rows inserted: 19000 Rows updated: 0 Rows deleted: 0
Rows inserted: 20000 Rows updated: 0 Rows deleted: 0
Rows inserted: 21000 Rows updated: 0 Rows deleted: 0
Rows inserted: 22000 Rows updated: 0 Rows deleted: 0
Rows inserted: 23000 Rows updated: 0 Rows deleted: 0
Rows inserted: 24000 Rows updated: 0 Rows deleted: 0
Rows inserted: 25000 Rows updated: 0 Rows deleted: 0
Rows inserted: 26000 Rows updated: 0 Rows deleted: 0
Rows inserted: 27000 Rows updated: 0 Rows deleted: 0
Rows inserted: 28000 Rows updated: 0 Rows deleted: 0
Rows inserted: 29000 Rows updated: 0 Rows deleted: 0
Rows inserted: 30000 Rows updated: 0 Rows deleted: 0
Rows inserted: 31000 Rows updated: 0 Rows deleted: 0
Rows inserted: 32000 Rows updated: 0 Rows deleted: 0
Rows inserted: 33000 Rows updated: 0 Rows deleted: 0
Rows inserted: 34000 Rows updated: 0 Rows deleted: 0
Rows inserted: 35000 Rows updated: 0 Rows deleted: 0
Rows inserted: 36000 Rows updated: 0 Rows deleted: 0
Rows inserted: 37000 Rows updated: 0 Rows deleted: 0
Rows inserted: 37927 Rows updated: 0 Rows deleted: 0
dolt sql < obs.sql  7.17s user 0.23s system 165% cpu 4.462 total