Closed rbdm-qnt closed 4 years ago
that seems slow, but you're not running on dedicated hardware, so I'm not too surprised. Most databases will not run well without something dedicated and in a cluster
that seems slow, but you're not running on dedicated hardware, so I'm not too surprised. Most databases will not run well without something dedicated and in a cluster
What would you suggest? I'm considering setting up a dedicated desktop computer just for this task
Unfortunately with Influx I think clustering is only available in the Enterprise, paid, version
Thank you. Mind leaving this open for a while for other people to chip in too?
2hrs for a query suggests you might have run out of memory & the system is using your hard drive for swap memory.
For big batch queries on 16GB memory box, you might be better off with parquet files for storage and then using dask/pandas for aggregations/etc. Dask can do aggregations on datasets larger then system memory.
With laptop 5400RPM HDD's, if you use parquet files with snappy compression, ~10 million rows would load in < 5sec (~500mb disk space). Then using pandas, most summary stats aggregations would be < 5 sec. Something like resampling to 1min OHLC might take a bit longer.
2hrs for a query suggests you might have run out of memory & the system is using your hard drive for swap memory.
For big batch queries on 16GB memory box, you might be better off with parquet files for storage and then using dask/pandas for aggregations/etc. Dask can do aggregations on datasets larger then system memory.
With laptop 5400RPM HDD's, if you use parquet files with snappy compression, ~10 million rows would load in < 5sec (~500mb disk space). Then using pandas, most summary stats aggregations would be < 5 sec. Something like resampling to 1min OHLC might take a bit longer.
Incredibly useful reply, thank you! Upon closer inspection, turns out that yes, my Mac is indeed using swap memory. I'm looking into Dask too. Regarding parquet, should I convert my Influx database to parquet by running 1 last query and converting and resaving all the historicals, or is there a more automatic/smarter way? Is there a way to use them to save live data?
Influxdb python has a dataframe client for pandas, pandas dataframe has a to_parquet method to save parquet files. Parquet considers all parquet files in the one folder a dataset, so you don't have to save all the historical data in one query (one file). For out-of-core aggregations with dask, you want 100-200mb per file ideally, faster parallelized read/query/etc & doesn't use all your spare memory.
For saving live data straight to parquet: https://github.com/bmoscon/cryptostore
Thank you so so much. Didn't realise cryptostore had native support for parquet! So I imagine I can also run a backfill on parquet files, correct? And what do you mean by out-of-core?
Thank you so much again, you are saving months of my time!! Very informative for any new users I'm sure
I don't use cryptostore, I have my own parquet catalogue/store. I tried it for a week last year, I might be wrong, but I think it backfills initially, but not for any interruptions after that. Using it to save parquet files is a lot more resource efficient than influx bloatware, bmoscon's asyncio code is pretty good.
out-of-core == https://en.wikipedia.org/wiki/External_memory_algorithm
Yes that is correct, cryptostore backfills only from specified date to the first date you stored in real-time. In another issue that I can't find anymore bmoscon said it's gonna be a major rewrite, but that he's gonna fix this.
I have one last silly question: after I save all the historical data into parquet files, let's say I only want to access tick data from 1/1/2019 to 31/12/2019. Is there a way to do this like you would in a database query with 2 dates?
For @bmoscon , are there ways to tell cryptostore the specifics for this? For example, store parquet tick data here and OB data there, or with different names if same folder, to be able to tell which files are which, and also save them with a specific size (100-200mb) like cheekybastard was advising? In the config I can only see some options for S3 and GCS, but nothing this specific
here's a test I did on my end. 1,111,111 rows execution time is 15.7sec
dedicated Xeon E-1275v6 32gb RAM
here's a test I did on my end. 1,111,111 rows execution time is 15.7sec
dedicated Xeon E-1275v6 32gb RAM
Interesting! Does this scale correctly? Like 2 minutes and 30 seconds for 11million rows etc? Mine doesn't, small queries go relatively fast (I don't think as fast as yours tho), big queries go slower and slower
do you need every tick data? can you aggregate OHLC with group by time(xx) ? that's much faster.
Unfortunately, I'm doing more additional calculus on the data, not just OHLCV, so I do need every tick
Today I tested the parquet file saving option from cryptostore, and it's perfect! I only have 1 question left for @bmoscon Is the L2 save time interval the right way of controlling the size of the parquet files? I was thinking of doing this: with Influx I was using 30s save interval in the cryptostore config; so If 30s worth of L2 data in a parquet file weights, let's say, 1mb, and I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval. My trades data files would weight less, but at least no file was exceed 200mb by a substantial margin.
Is this the best way to do it? Did I interpret the config file correctly?
I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval.
I suggested that size for your data already archived in influx. Smaller parquet files resulting from cryptostore saving regularly will not add much of a performance hit, parquet clients multi-thread when reading a dataset(directory).
I want my parquet files between 100-200mb like @cheekybastard said, I could do 30s * 100 and put that as the save interval.
I suggested that size for your data already archived in influx. Smaller parquet files resulting from cryptostore saving regularly will not add much of a performance hit, parquet clients multi-thread when reading a dataset(directory).
I'm gonna erase my Influx database and re-backfill from scratch anyway, since I had some holes in my March data. So I guess I'll try something in between, like 600s for example. Thank you so much for your input, I wish I could read this thread 3 months ago :)
not sure the storege_interval
can be customized for each "save_type" or not.
for influxdb, 30 -60seconds is good. but 30 -60seconds is too short for parquet, which would result to many many small files. 1 hour or 30 mins, imo, is a good solution, a large duration like 12 hours may crash the memory by redis memory size.
not sure the
storege_interval
can be customized for each "save_type" or not.for influxdb, 30 -60seconds is good. but 30 -60seconds is too short for parquet, which would result to many many small files. 1 hour or 30 mins, imo, is a good solution, a large duration like 12 hours may crash the memory by redis memory size.
I don't think so either. I'll just ditch Influx. Thank you for your input!
@rbdm-qnt there is no way to control the size of the files, cryptostore works with time intervals, not size intervals
Thanks moscon, I saw that once I finished my backfill. It's working perfectly now!
@bmoscon can the "storage interval" be customized for each storage type?
mentioned here: https://github.com/bmoscon/cryptostore/issues/19
@rbdm-qnt do you think this suggestion is useful? if you want use parquet files
@rbdm-qnt do you think this suggestion is useful? if you want use parquet files
bmoscon/cryptostore#39 by @ghost
As far as I know the storage interval can't be customized for each storage type. The suggestion #39 is not ideal for me because my functions already process the OB with the previous syntax, I'd rather not change it, as I'd have to rewrite a lot of lines of code
this is old, and seems resolved - at least - there is nothing to fix. closing
I know that this problem is NOT a cryptofeed/cryptostore problem, and I can remove post if it's considered too offtopic! I tried asking for this in all the related githubs and forums and have received no answers, so I'm trying here since I expect a lot of people who use cryptofeed to use Influx too, and for the same exact application as me.
I’m querying BitMEX tick data to process it and aggregate it. Those were my query times (number of rows on the left, query time in hours, minutes and seconds on the right), were every query is 1 month of data:
9480222, 2:07:46
12839124, 3:06:02
17256737, 4:19:54
13716707, 3:28:37
12671435, 2:35:27
11112483, 2:15:53
17055181, 3:34:21
21232810, 6:29:42
16935780, 4:47:56
Those numbers seem a bit off. The average is around 60-70k rows per minute, 1k rows per second. Since this is my first experience with TS Databases and with Influx, would you consider this performance normal? Do you guys have roughly the same query times? I’m running InfluxDB 1.7.9, Influx Python client 5.2.3, Python 3.7, this was ran from Pycharm, on a MacBook Pro with 16GB Ram. I don’t use any addon like Telegraph or Kapacitor.