BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
45 stars 21 forks source link

Using chunksize gives `TypeError: 'TextFileReader' object does not support item assignment` #106

Open nigelcharman opened 5 months ago

nigelcharman commented 5 months ago

We've been using python-dwca-reader with no problems loading about 13k occurrences. We now need to scale it up to load about 3.25m occurrences.

Changing the code from:

        core_df = dwca.pd_read('occurrence.txt', parse_dates=True)

to:

        for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
        ...

causes the error:

    ...
    for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/opt/asdf/installs/python/3.11.7/lib/python3.11/site-packages/dwca/read.py", line 209, in pd_read
    df[shorten_term(field['term'])] = field_default_value
    ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'TextFileReader' object does not support item assignment

Looking at gbif-alert, I see that you're using enumerate(dwca) rather than reading it in chunks, so I'll give that a try.

nigelcharman commented 4 months ago

We're now using enumerate(dwca) so we're in no rush to have this corrected. I'll leave the issue open though in case other people come across it.

niconoe commented 4 months ago

Note to self: it only happens with the combination of chunksize (and probably also the iterator parameter) and the DwCA using default values (because pd_read returns a TextFileReader rather than a regular data frame)

niconoe commented 4 months ago

After careful inspection I can't see any sane way to deal with this specific combination (pd_read returning TextFileReader objects because of its parameters and the DwC-A using default values).

I therefore decided to document the incompatibility + add a human readable exception for that situation. This is also tested.

nigelcharman commented 4 months ago

Would it be worth adding a note to https://python-dwca-reader.readthedocs.io/en/latest/pandas_tutorial.html too? It was this documentation that led me to believe that this combination might be possible.