Closed cs-cordero closed 7 years ago
Not sure of the implications of this so will defer to others, but nice work! 👏
Nice work. The bug appears to have been introduced due to this enhancement in pandas
- http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#read-csv-will-progressively-enumerate-chunks:
read_csv will progressively enumerate chunks When read_csv() is called with chunksize=n and without specifying an index, each chunk used to have an independently generated index from 0 to n-1. They are now given instead a progressive index, starting from 0 for the first chunk, from n for the second, and so on, so that, when concatenated, they are identical to the result of calling read_csv() without the chunksize= argument (GH12185).
Currently in
master
, there is a bug that occurs when you want to read a csv in in chunks rather than loading everything in into memory all at once. There is an erroneous line of code that renumbers the row index based on thechunksize
, but pandas already keeps track of row indexes correctly.I also increased the number of test rows in the relevant
pytest
. If the function is meant to read large csvs in chunks, I felt that it would make sense to use a little more data and use achunksize
greater than 1.Closes #98.
(also thanks @afeld for helping me out with understanding some of the
sqlalchemy
and setup shenanigans at the Hacker Hours event earlier this evening)