Optimizing / expectations for large databases?

blackfeather9 commented 1 month ago

First of all, thank you so much for this tool! The existing tools I found did not work as desired after hours of troubleshooting Python dependencies, etc. Your script is successfully migrating my historical data to InfluxDB, it's been running for 24 hours at this point.

My SQLite database has over 5 million rows (it's about 1.3GB) for a year of data from about 10 sensors, then the rest of the Recorder database noise. I'm running it against a backup of my HA database on a system with a fast, 16-thread CPU. It is connecting to the InfluxDB server over a gigabit connection, and that server has 4 cores and fast SSD storage. It's not being taxed at all, compute-wise.

I am running it with DEBUG=TRUE and uncommented the logging.info(f"Processed {rows_fetched} rows so far.") and I can see it's averaging about 3400 rows/minute using ~1% CPU on my workstation, ~20% on InfluxDB. Batch size set at 10,000 so I would get frequent progress updates.

What was your experience performance-wise when migrating your data? Is there any way we could potentially speed up the process? I may need to run this several times. I'm a novice with Python but my own LLM suggested alternatives with the influxDB API or threaded processing. Maybe I'm just impatient but happy to test any options or benchmark against my own dataset. Cheers!

eldigo commented 1 month ago

Hi, I was running all instances locally and I think I got 1000 rows per second. But you need to turn of debugging. As with debug mode true, all records are inserted individually. With debug false. The whole batch (or what's left of it after some validation checks), is inserted at once.

blackfeather9 commented 1 month ago

Hi there, appreciate the tip! I am running a new import and it's screaming fast. Once I confirmed it was copying, I turned off debug mode and now it's migrated almost 2 million rows in a matter of minutes. Guess I didn't look at the code closely enough.

It seems to be merging flawlessly with my active bucket as long as I added the following to the influxdb config in home assistant:

  tags:
    source: HA
  tags_attributes:
    - friendly_name
  default_measurement: units

This is amazing, thanks for solving a challenging problem!

eldigo / ha-sqllite-2-influxdb

Optimizing / expectations for large databases? #1