billpwchan / futu_algo

Futu Algorithmic Trading Solution (Python) 基於富途OpenAPI所開發量化交易程序
Apache License 2.0
377 stars 130 forks source link

Use Parquet instead of CSV for better performance #38

Closed billpwchan closed 2 years ago

billpwchan commented 2 years ago
  1. Refactor the code so all the I/O related operations are encapsulated in data_engine
  2. Convert CSV to Parquet with 2X saving in file size...Amazing
  3. Create a utility function for ppl who've already downloaded historical data in CSV format, and they can easily use the utility function to convert all CSV to Parquet ease.
billpwchan commented 2 years ago

Fixed in e8808cc0f8ae971042eeb44cb6dc7a80cad4bbde

billpwchan commented 2 years ago

A quick comparision between CSV and Parquet

CSV | Parquet -- | -- Row-based storage format. | A hybrid of Row-based and column-based storage formats. It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud. | Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space. Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved. | Query time is about 34 times faster because of the column-based storage and presence of metadata. More data has to be scanned per query. | About 99% less data is scanned for the execution of the query, thus optimizing performance. Most storage devices charge based on the storage space, so CSV format means the high storage cost. | Less storage cost as data is stored in compressed, encoded format. File schema has to be either inferred (leading to errors) or supplied (tedious). | File schema is stored in the metadata. The format is suitable for simple data types. | Parquet is suitable even for complex types like nested schemas, arrays, dictionaries.

Credit: https://geekflare.com/parquet-csv-data-storage/