TileDB-Inc / TileDB-R

R interface to TileDB: The Modern Database
https://tiledb-inc.github.io/TileDB-R
Other
103 stars 18 forks source link

Store OHLCV financial data #463

Closed MislavSag closed 2 years ago

MislavSag commented 2 years ago

Hello,

I have discovered TileDB while browsing on potentially good options for storing financial datasets.

I have simple goal for the beginning: save 1-minute OHLCV financial data on AWS using TileDB (and compare performances with Azure blob / R combo).

Here is my try after reading documentation for few hours:

# import data
blob_saas <- "https://contentiobatch.blob.core.windows.net/equity-usa-minute-trades-fmplcoud-adjusted/aapl/20220612T201404Z-1e18e/aapl.csv?sp=r&st=2022-09-03T22:01:10Z&se=2022-09-11T06:01:10Z&sv=2021-06-08&sr=b&sig=kyuULXsWMUM%2Bq8gDHnvY9RKA29hYgOKaYz9ICjVZdYE%3D"
s <- Sys.time()
aapl <- data.table::fread(blob_saas)
e <- Sys.time()
e - s
aapl[, symbol := "AAPL"]
aapl$datetime <- force_tz(aapl$datetime, "EST") # how TileDB handles timezones ???

# configure s3
config <- tiledb_config()
config["vfs.s3.aws_access_key_id"] <- "xxx"
config["vfs.s3.aws_secret_access_key"] <- "xxx"
config["vfs.s3.region"] <- "us-east-1"
context_with_config <- tiledb_ctx(config)

# create sparse (or better dense ???) array from df
url <- "s3://equity-us-tick/aapl"
fromDataFrame(
  obj = aapl,
  uri = url,
  col_index = c("symbol", "datetime"),
  sparse = TRUE,
  allows_dups = FALSE # doesn't allow duplicates of AAPL, datetime pair ?
)

# read data
arr <- tiledb_array(url, as.data.frame=TRUE)
s <- Sys.time()
newdf <- arr[]
e <- Sys.time()
s - e

When I execute the code, it is working. But I kindly ask for feedback, is this a good approach to store financial data?

Above code returns 1.08 minutes for retrieving csv from blob and 24 seconds for retrieving array from S3.

eddelbuettel commented 2 years ago

Hi @MislavSag -- that is a very reasonable first approach, and in effect close to what we do in last year's tutorial at the annual useR! conference. My slides are still at e.g. this link (and my talks page reminds me there also a video. We show an example of one-minute bars from a public data set on pages 70 to 75 that is very similar to what you do.

One thing to consider is to set an explicit range from 'as far back as you may need' to 'as far forward as you may expect' so that you can expand by appending future datetime values. (Character values such as the ticker have no such limit.) That way you can grow the array and index by datetime and/or ticker -- all in one array without having to worry about partitions.

There are other possible refinement you can make, and it would get more complicated with truly tick-by-tick data data. We have enhancements in the works, so we can follow-up here or on slack. Other possible enhancements are different tile and query layouts, as well as filters -- you can experiment. Lastly, one thing I would change now is allow_dups=TRUE as it will be faster even if you likely have no duplicates over <datetime, ticker> as your indices.

MislavSag commented 2 years ago

Thanks for fast response. I have checked the slides. I see there is an additional argument for range

tile_domain=list(Datetime=c(as.POSIXct(”1970-01-01 00:00:00”), Sys.time())

I suppose Sys.time should be some big date (like as.POSIXct(”2500-01-01 00:00:00”)). I have just started with TileDB, so I will experiment with layouts and filters.

Yes, tick data should be more complicated. I have them on TODO list, but later this year. I plan to download data from DukasCopy and finam and use TileDB if I will be happy with it.

eddelbuettel commented 2 years ago

Yes, that is simple variant 'back in the past' until 'right now' by using system time. If you wanted to be more forward-looking, the second argument could be, say, as.POSIXct("2029-12-31 23:59:59") or some other date.

Feel free to join us too on the community Slack via this link also at the bottom of the home page.

MislavSag commented 2 years ago

I have joined Slack. I will continue to further questions there. Thanks!