18F / autoapi

A basic spreadsheet to api engine
Other
42 stars 18 forks source link

Fix error in chunk table loader #99

Closed cs-cordero closed 7 years ago

cs-cordero commented 7 years ago

Currently in master, there is a bug that occurs when you want to read a csv in in chunks rather than loading everything in into memory all at once. There is an erroneous line of code that renumbers the row index based on the chunksize, but pandas already keeps track of row indexes correctly.

I also increased the number of test rows in the relevant pytest. If the function is meant to read large csvs in chunks, I felt that it would make sense to use a little more data and use a chunksize greater than 1.

Closes #98.

(also thanks @afeld for helping me out with understanding some of the sqlalchemy and setup shenanigans at the Hacker Hours event earlier this evening)

afeld commented 7 years ago

Not sure of the implications of this so will defer to others, but nice work! 👏

vrajmohan commented 7 years ago

Nice work. The bug appears to have been introduced due to this enhancement in pandas - http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#read-csv-will-progressively-enumerate-chunks:

read_csv will progressively enumerate chunks When read_csv() is called with chunksize=n and without specifying an index, each chunk used to have an independently generated index from 0 to n-1. They are now given instead a progressive index, starting from 0 for the first chunk, from n for the second, and so on, so that, when concatenated, they are identical to the result of calling read_csv() without the chunksize= argument (GH12185).