catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.92k stars 564 forks source link

[šŸ”Š] Decouple table creation from row insertion #4442

Closed perezju closed 6 years ago

perezju commented 6 years ago

The current pandas_sqlite.InsertOrReplaceRecords (as well as the underlying pandas.io.sql methods) take a data frame as input and:

This works well, except when we try to involve multiprocessing. There is a race condition in which several child processes may detect that the table does not yet exist, and all of them try to create it at once.

To solve this we should:

  1. On the parent process, if it doesn't exist already, create a new (empty) table.
  2. All of the child processes only add rows to an existing table.

To implement 1, however, we also need to explicitly tell pandas the expected types for each column (to build the needed "empty" frame), instead of letting it guess the types out of data returned from dashboard API calls. Which is probably a good thing anyway.

@zeptonaut