Loading files in memory tends to speed up IO by quite a bit. A simple benchmark on the 1m.csv file with 1 million lines took ~2kms with the current DataTable.csv_read() opposed to Pandas ~200ms. After toying with memory mapping we sped up to just under 200ms. I will keep playing with this technique so we can achieve the same functionality as Pandas where we can infer the native datatypes by column.
Pandas seems to use some tryexcept type logic starting with converting numbers to type int as the first try. As of now we are doing something similar as well as checks using regular expressions for determining types. This likely isn't adequate and should be overhauled with a more advanced algorithm for our case of "pattern matching"
Loading files in memory tends to speed up IO by quite a bit. A simple benchmark on the 1m.csv file with 1 million lines took ~2kms with the current
DataTable.csv_read()
opposed to Pandas ~200ms. After toying with memory mapping we sped up to just under 200ms. I will keep playing with this technique so we can achieve the same functionality as Pandas where we can infer the native datatypes by column.Pandas seems to use some
try
except
type logic starting with converting numbers to type int as the first try. As of now we are doing something similar as well as checks using regular expressions for determining types. This likely isn't adequate and should be overhauled with a more advanced algorithm for our case of "pattern matching"