ankane / ruby-polars

Blazingly fast DataFrames for Ruby
MIT License
814 stars 28 forks source link

Low memory batched csv reader issue #59

Open jvdp opened 5 months ago

jvdp commented 5 months ago

Hi,

I've been trying to get the read_csv_batched to work but I keep running into weird issues. In trying to find a minimal reproduction I've found the following so far:

def test_read_csv_batched_low_memory
  reader1 = Polars.read_csv_batched("test/support/data.csv", batch_size: 1)
  reader2 = Polars.read_csv_batched("test/support/data.csv", batch_size: 1, low_memory: true)
  assert_equal reader1.next_batches(10).sum(&:count), reader2.next_batches(10).sum(&:count)
end

Which results in:

  1) Failure:
CsvTest#test_read_csv_batched_low_memory [test/csv_test.rb:79]:
Expected: 3
  Actual: 2

(I'm also running into issues when providing dtypes but I've not narrowed that down yet.)

ankane commented 5 months ago

Hi @jvdp, thanks for reporting. It looks like the Python library has the same behavior.

import polars as pl

reader1 = pl.read_csv_batched('test/support/data.csv', batch_size=1)
print(reader1.next_batches(10))

reader2 = pl.read_csv_batched('test/support/data.csv', batch_size=1, low_memory=True)
print(reader2.next_batches(10))

We could filter out the empty batches, but it'll still return less than expected, so I think it probably needs to be addressed upstream.

ankane commented 5 months ago

Looks like it's been reported here: https://github.com/pola-rs/polars/issues/9577

jvdp commented 5 months ago

Ah, interesting, thanks for digging that one up! Will keep an eye on it.