akielaries / openGPMP

Hardware Accelerated General Purpose Mathematics Package
https://akielaries.github.io/openGPMP/
MIT License
8 stars 3 forks source link

DataTable file readers should utilize memory mapping #81

Closed akielaries closed 11 months ago

akielaries commented 1 year ago

Loading files in memory tends to speed up IO by quite a bit. A simple benchmark on the 1m.csv file with 1 million lines took ~2kms with the current DataTable.csv_read() opposed to Pandas ~200ms. After toying with memory mapping we sped up to just under 200ms. I will keep playing with this technique so we can achieve the same functionality as Pandas where we can infer the native datatypes by column.

Pandas seems to use some try except type logic starting with converting numbers to type int as the first try. As of now we are doing something similar as well as checks using regular expressions for determining types. This likely isn't adequate and should be overhauled with a more advanced algorithm for our case of "pattern matching"

akielaries commented 12 months ago

In additional to the innate DataTable class, I will look into support for the C++DataFrame project https://github.com/hosseinmoein/DataFrame/tree/master