Closed kenorb closed 8 years ago
LemonBoy sorry I did not see in Bountysource that you are already working on a solution. Mine solve it from different angle, we can combine both, after some research I see I should rewrite mine to use numpy.memmap instead of in memory list - as with my current solution it may raise out of memory errors on extra large input files or low memory systems.
One of the conversion routines collapses the whole set of csv files into a single one, I doubt it's a wise idea to abuse the memory so much (also, Travis might not like that memory usage :). The profiler shows that the hottest points are the place where the string is split and the timestamp parsing, the latter has been optimized as much as I could and the former is quite optimized by itself. A two-fold speedup when executed on a beefy machine could be enough to stay under the 50m mark
Would be best to combine our solutions - your optimization of timestamp parsing and mine parsing timestamp only once (not once per timeframe like before) . I also wonder about memory usage, it may become a problem only with even larger input files. Best would be to parse input file once and store parsed results in numpy.memmap temporary file . I will finish this solution tomorrow.
@LemonBoy I think something got broken as I've 3 minutes and the data isn't right (files are empty).
It looks like it's failing silently and the files are empty. It seems to work when reset to 6656d9ce6d0ac57acd1efc0107aec3cf3c0cd7c8, since it takes a lot of more time.
So I've moved the performance fixes into separate performance
branch for further testing and improvements.
Tested PR #51, full conversion from 3h dropped to 45 minutes. I'll do some further testing on CI after merge.
CI test results based on the 1 month data conversion: Before: Elapsed time 24 min 44 sec After: Elapsed time 12 min 56 sec
@Kostafun CI build fails on conversion of full year of data with:
/home/travis/build.sh: fork: Cannot allocate memory
Increased kernel.shmmax/kernel.shmall, but it didn't help.
Any way of loading part of the file to the memory? Or certain size?
Or maybe compressing the data into the memory would help?
PR #57 introduced different method to prevent out-of-memory issue on CI, having the same timing when comparing reading the file from the memory.
Tested process on OS X for both conversion methods as described in the ticket, the completion time was 56 minutes instead of over 3h, giving 3x speed up.
Since 50 minutes still wasn't enough (as exceeded 50min) for both conversions (FXT & HST), so I've split up the test to convert the formats separately, so HST took 26min and FXT 41 minutes (for the whole year data), so both tests are passed.
Therefore I'll close the ticket, unless you've anything to add. The bounty would need to be split if that's ok. Plus extra $10 due to some previous confusion. Thanks for your work.
The PR #57 reverts the changes introduced by @Kostafun (and it leaves out some print statements since I was lazy, as you can notice from the commit message) and is a simple rebase of the original changeset I proposed. It uses mmap to read the file in memory to leave some part of the work to the OS, but that's just a marginal speedup, the real one is in the optimized timestamp parser and the more compact pack. It might be possible to shave some more time by shuffling some code around though.
Currently conversion of CSV data from only one year (800M) into tick data format takes around 1 hour to complete (for all timeframes).
This makes fails for CI builds, because timeout of 30 minutes being reached in order to convert the data from here. This one failed after maximum time of 50 minutes.
Testing:
I didn't count, but it's ~3h on MBP (OS X).
The goal is to reduce the conversion time of the data mentioned above, so it completes in reasonable time (e.g. less than half an hour if that's achievable, otherwise max. 50 min), so CI won't fail (max time is 50 minutes).
File:
convert_csv_to_mt.py
Test:
which makes ~3.11h
Result files: