bmoscon / cryptofeed

Cryptocurrency Exchange Websocket Data Feed Handler
Other
2.23k stars 688 forks source link

InfluxDB speed issue #228

Closed rbdm-qnt closed 4 years ago

rbdm-qnt commented 4 years ago

I know that this problem is NOT a cryptofeed/cryptostore problem, and I can remove post if it's considered too offtopic! I tried asking for this in all the related githubs and forums and have received no answers, so I'm trying here since I expect a lot of people who use cryptofeed to use Influx too, and for the same exact application as me.

I’m querying BitMEX tick data to process it and aggregate it. Those were my query times (number of rows on the left, query time in hours, minutes and seconds on the right), were every query is 1 month of data:

9480222, 2:07:46

12839124, 3:06:02

17256737, 4:19:54

13716707, 3:28:37

12671435, 2:35:27

11112483, 2:15:53

17055181, 3:34:21

21232810, 6:29:42

16935780, 4:47:56

Those numbers seem a bit off. The average is around 60-70k rows per minute, 1k rows per second. Since this is my first experience with TS Databases and with Influx, would you consider this performance normal? Do you guys have roughly the same query times? I’m running InfluxDB 1.7.9, Influx Python client 5.2.3, Python 3.7, this was ran from Pycharm, on a MacBook Pro with 16GB Ram. I don’t use any addon like Telegraph or Kapacitor.

bmoscon commented 4 years ago

that seems slow, but you're not running on dedicated hardware, so I'm not too surprised. Most databases will not run well without something dedicated and in a cluster

rbdm-qnt commented 4 years ago

that seems slow, but you're not running on dedicated hardware, so I'm not too surprised. Most databases will not run well without something dedicated and in a cluster

What would you suggest? I'm considering setting up a dedicated desktop computer just for this task

rbdm-qnt commented 4 years ago

Unfortunately with Influx I think clustering is only available in the Enterprise, paid, version

bmoscon commented 4 years ago

https://docs.influxdata.com/influxdb/v1.7/guides/hardware_sizing/#single-node-or-cluster

https://www.influxdata.com/blog/optimizing-influxdb-performance-for-high-velocity-data/

rbdm-qnt commented 4 years ago

Thank you. Mind leaving this open for a while for other people to chip in too?

cheekybastard commented 4 years ago

2hrs for a query suggests you might have run out of memory & the system is using your hard drive for swap memory.

For big batch queries on 16GB memory box, you might be better off with parquet files for storage and then using dask/pandas for aggregations/etc. Dask can do aggregations on datasets larger then system memory.

With laptop 5400RPM HDD's, if you use parquet files with snappy compression, ~10 million rows would load in < 5sec (~500mb disk space). Then using pandas, most summary stats aggregations would be < 5 sec. Something like resampling to 1min OHLC might take a bit longer.

rbdm-qnt commented 4 years ago

2hrs for a query suggests you might have run out of memory & the system is using your hard drive for swap memory.

For big batch queries on 16GB memory box, you might be better off with parquet files for storage and then using dask/pandas for aggregations/etc. Dask can do aggregations on datasets larger then system memory.

With laptop 5400RPM HDD's, if you use parquet files with snappy compression, ~10 million rows would load in < 5sec (~500mb disk space). Then using pandas, most summary stats aggregations would be < 5 sec. Something like resampling to 1min OHLC might take a bit longer.

Incredibly useful reply, thank you! Upon closer inspection, turns out that yes, my Mac is indeed using swap memory. I'm looking into Dask too. Regarding parquet, should I convert my Influx database to parquet by running 1 last query and converting and resaving all the historicals, or is there a more automatic/smarter way? Is there a way to use them to save live data?

cheekybastard commented 4 years ago

Influxdb python has a dataframe client for pandas, pandas dataframe has a to_parquet method to save parquet files. Parquet considers all parquet files in the one folder a dataset, so you don't have to save all the historical data in one query (one file). For out-of-core aggregations with dask, you want 100-200mb per file ideally, faster parallelized read/query/etc & doesn't use all your spare memory.

For saving live data straight to parquet: https://github.com/bmoscon/cryptostore

rbdm-qnt commented 4 years ago

Thank you so so much. Didn't realise cryptostore had native support for parquet! So I imagine I can also run a backfill on parquet files, correct? And what do you mean by out-of-core?

Thank you so much again, you are saving months of my time!! Very informative for any new users I'm sure

cheekybastard commented 4 years ago

I don't use cryptostore, I have my own parquet catalogue/store. I tried it for a week last year, I might be wrong, but I think it backfills initially, but not for any interruptions after that. Using it to save parquet files is a lot more resource efficient than influx bloatware, bmoscon's asyncio code is pretty good.

out-of-core == https://en.wikipedia.org/wiki/External_memory_algorithm

rbdm-qnt commented 4 years ago

Yes that is correct, cryptostore backfills only from specified date to the first date you stored in real-time. In another issue that I can't find anymore bmoscon said it's gonna be a major rewrite, but that he's gonna fix this.

I have one last silly question: after I save all the historical data into parquet files, let's say I only want to access tick data from 1/1/2019 to 31/12/2019. Is there a way to do this like you would in a database query with 2 dates?

For @bmoscon , are there ways to tell cryptostore the specifics for this? For example, store parquet tick data here and OB data there, or with different names if same folder, to be able to tell which files are which, and also save them with a specific size (100-200mb) like cheekybastard was advising? In the config I can only see some options for S3 and GCS, but nothing this specific

bonald commented 4 years ago

Screenshot_20200408-115027

here's a test I did on my end. 1,111,111 rows execution time is 15.7sec

dedicated Xeon E-1275v6 32gb RAM

rbdm-qnt commented 4 years ago

Screenshot_20200408-115027

here's a test I did on my end. 1,111,111 rows execution time is 15.7sec

dedicated Xeon E-1275v6 32gb RAM

Interesting! Does this scale correctly? Like 2 minutes and 30 seconds for 11million rows etc? Mine doesn't, small queries go relatively fast (I don't think as fast as yours tho), big queries go slower and slower

bonald commented 4 years ago

do you need every tick data? can you aggregate OHLC with group by time(xx) ? that's much faster.

rbdm-qnt commented 4 years ago

Unfortunately, I'm doing more additional calculus on the data, not just OHLCV, so I do need every tick

rbdm-qnt commented 4 years ago

Today I tested the parquet file saving option from cryptostore, and it's perfect! I only have 1 question left for @bmoscon Is the L2 save time interval the right way of controlling the size of the parquet files? I was thinking of doing this: with Influx I was using 30s save interval in the cryptostore config; so If 30s worth of L2 data in a parquet file weights, let's say, 1mb, and I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval. My trades data files would weight less, but at least no file was exceed 200mb by a substantial margin.

Is this the best way to do it? Did I interpret the config file correctly?

cheekybastard commented 4 years ago

I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval.

I suggested that size for your data already archived in influx. Smaller parquet files resulting from cryptostore saving regularly will not add much of a performance hit, parquet clients multi-thread when reading a dataset(directory).

rbdm-qnt commented 4 years ago

I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval.

I suggested that size for your data already archived in influx. Smaller parquet files resulting from cryptostore saving regularly will not add much of a performance hit, parquet clients multi-thread when reading a dataset(directory).

I'm gonna erase my Influx database and re-backfill from scratch anyway, since I had some holes in my March data. So I guess I'll try something in between, like 600s for example. Thank you so much for your input, I wish I could read this thread 3 months ago :)

xiandong79 commented 4 years ago

not sure the storege_interval can be customized for each "save_type" or not.

for influxdb, 30 -60seconds is good. but 30 -60seconds is too short for parquet, which would result to many many small files. 1 hour or 30 mins, imo, is a good solution, a large duration like 12 hours may crash the memory by redis memory size.

rbdm-qnt commented 4 years ago

not sure the storege_interval can be customized for each "save_type" or not.

for influxdb, 30 -60seconds is good. but 30 -60seconds is too short for parquet, which would result to many many small files. 1 hour or 30 mins, imo, is a good solution, a large duration like 12 hours may crash the memory by redis memory size.

I don't think so either. I'll just ditch Influx. Thank you for your input!

bmoscon commented 4 years ago

@rbdm-qnt there is no way to control the size of the files, cryptostore works with time intervals, not size intervals

rbdm-qnt commented 4 years ago

Thanks moscon, I saw that once I finished my backfill. It's working perfectly now!

xiandong79 commented 4 years ago

@bmoscon can the "storage interval" be customized for each storage type?

mentioned here: https://github.com/bmoscon/cryptostore/issues/19

xiandong79 commented 4 years ago

@rbdm-qnt do you think this suggestion is useful? if you want use parquet files

https://github.com/bmoscon/cryptostore/pull/39 by @ghost

rbdm-qnt commented 4 years ago

@rbdm-qnt do you think this suggestion is useful? if you want use parquet files

bmoscon/cryptostore#39 by @ghost

As far as I know the storage interval can't be customized for each storage type. The suggestion #39 is not ideal for me because my functions already process the OB with the previous syntax, I'd rather not change it, as I'd have to rewrite a lot of lines of code

bmoscon commented 4 years ago

this is old, and seems resolved - at least - there is nothing to fix. closing