FX31337 / FX-BT-Scripts

:page_facing_up: Useful scripts for backtesting.
MIT License
34 stars 39 forks source link

convert_csv_to_mt.py takes ages to convert the data [$60 awarded] #37

Closed kenorb closed 8 years ago

kenorb commented 8 years ago

Currently conversion of CSV data from only one year (800M) into tick data format takes around 1 hour to complete (for all timeframes).

This makes fails for CI builds, because timeout of 30 minutes being reached in order to convert the data from here. This one failed after maximum time of 50 minutes.

Testing:

git clone --single-branch --branch EURUSD-2014 https://github.com/FX31337/FX-BT-Data-EURUSD-DS.git
cd FX*
curl -o Makefile https://raw.githubusercontent.com/FX31337/FX-BT-Data-Test/master/Makefile
time make

I didn't count, but it's ~3h on MBP (OS X).

The goal is to reduce the conversion time of the data mentioned above, so it completes in reasonable time (e.g. less than half an hour if that's achievable, otherwise max. 50 min), so CI won't fail (max time is 50 minutes).

File: convert_csv_to_mt.py

Test:

$ time make
...
Done.

real    187m37.007s
user    185m2.853s
sys 1m7.004s

which makes ~3.11h

Result files:

-rw-r--r--  1 kenorb 4.7M Mar 29 11:35 EURUSD1.hst.gz
-rw-r--r--  1 kenorb 2.0K Mar 29 12:43 EURUSD10080.hst.gz
-rw-r--r--  1 kenorb  80M Mar 29 14:19 EURUSD10080_0.fxt.gz
-rw-r--r--  1 kenorb 9.8K Mar 29 12:34 EURUSD1440.hst.gz
-rw-r--r--  1 kenorb  80M Mar 29 14:09 EURUSD1440_0.fxt.gz
-rw-r--r--  1 kenorb 477K Mar 29 11:55 EURUSD15.hst.gz
-rw-r--r--  1 kenorb  83M Mar 29 13:25 EURUSD15_0.fxt.gz
-rw-r--r--  1 kenorb  90M Mar 29 13:04 EURUSD1_0.fxt.gz
-rw-r--r--  1 kenorb  40K Mar 29 12:24 EURUSD240.hst.gz
-rw-r--r--  1 kenorb  80M Mar 29 13:58 EURUSD240_0.fxt.gz
-rw-r--r--  1 kenorb 257K Mar 29 12:04 EURUSD30.hst.gz
-rw-r--r--  1 kenorb  82M Mar 29 13:36 EURUSD30_0.fxt.gz
-rw-r--r--  1 kenorb  597 Mar 29 12:53 EURUSD43200.hst.gz
-rw-r--r--  1 kenorb  80M Mar 29 14:30 EURUSD43200_0.fxt.gz
-rw-r--r--  1 kenorb 1.3M Mar 29 11:45 EURUSD5.hst.gz
-rw-r--r--  1 kenorb  86M Mar 29 13:15 EURUSD5_0.fxt.gz
-rw-r--r--  1 kenorb 139K Mar 29 12:14 EURUSD60.hst.gz
-rw-r--r--  1 kenorb  81M Mar 29 13:47 EURUSD60_0.fxt.gz

--- The **[$60 bounty](https://www.bountysource.com/issues/32272156-convert_csv_to_mt-py-takes-ages-to-convert-the-data?utm_campaign=plugin&utm_content=tracker%2F20487492&utm_medium=issues&utm_source=github)** on this issue has been claimed at [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F20487492&utm_medium=issues&utm_source=github).
Kostafun commented 8 years ago

LemonBoy sorry I did not see in Bountysource that you are already working on a solution. Mine solve it from different angle, we can combine both, after some research I see I should rewrite mine to use numpy.memmap instead of in memory list - as with my current solution it may raise out of memory errors on extra large input files or low memory systems.

LemonBoy commented 8 years ago

One of the conversion routines collapses the whole set of csv files into a single one, I doubt it's a wise idea to abuse the memory so much (also, Travis might not like that memory usage :). The profiler shows that the hottest points are the place where the string is split and the timestamp parsing, the latter has been optimized as much as I could and the former is quite optimized by itself. A two-fold speedup when executed on a beefy machine could be enough to stay under the 50m mark

Kostafun commented 8 years ago

Would be best to combine our solutions - your optimization of timestamp parsing and mine parsing timestamp only once (not once per timeframe like before) . I also wonder about memory usage, it may become a problem only with even larger input files. Best would be to parse input file once and store parsed results in numpy.memmap temporary file . I will finish this solution tomorrow.

kenorb commented 8 years ago

@LemonBoy I think something got broken as I've 3 minutes and the data isn't right (files are empty).

It looks like it's failing silently and the files are empty. It seems to work when reset to 6656d9ce6d0ac57acd1efc0107aec3cf3c0cd7c8, since it takes a lot of more time.

So I've moved the performance fixes into separate performance branch for further testing and improvements.

kenorb commented 8 years ago

Tested PR #51, full conversion from 3h dropped to 45 minutes. I'll do some further testing on CI after merge.

CI test results based on the 1 month data conversion: Before: Elapsed time 24 min 44 sec After: Elapsed time 12 min 56 sec

kenorb commented 8 years ago

@Kostafun CI build fails on conversion of full year of data with:

/home/travis/build.sh: fork: Cannot allocate memory

Increased kernel.shmmax/kernel.shmall, but it didn't help.

Any way of loading part of the file to the memory? Or certain size?

Or maybe compressing the data into the memory would help?

kenorb commented 8 years ago

PR #57 introduced different method to prevent out-of-memory issue on CI, having the same timing when comparing reading the file from the memory.

Tested process on OS X for both conversion methods as described in the ticket, the completion time was 56 minutes instead of over 3h, giving 3x speed up.

Since 50 minutes still wasn't enough (as exceeded 50min) for both conversions (FXT & HST), so I've split up the test to convert the formats separately, so HST took 26min and FXT 41 minutes (for the whole year data), so both tests are passed.

Therefore I'll close the ticket, unless you've anything to add. The bounty would need to be split if that's ok. Plus extra $10 due to some previous confusion. Thanks for your work.

LemonBoy commented 8 years ago

The PR #57 reverts the changes introduced by @Kostafun (and it leaves out some print statements since I was lazy, as you can notice from the commit message) and is a simple rebase of the original changeset I proposed. It uses mmap to read the file in memory to leave some part of the work to the OS, but that's just a marginal speedup, the real one is in the optimized timestamp parser and the more compact pack. It might be possible to shave some more time by shuffling some code around though.