Issue: reading-in a Dino zip file returns a ValueError due to a duplicate time-index

ArtesiaWater / hydropandas

Module for loading observation data into custom DataFrames

https://hydropandas.readthedocs.io

MIT License

53 stars 11 forks source link

Issue: reading-in a Dino zip file returns a ValueError due to a duplicate time-index #61

Closed jvansijl closed 3 years ago

jvansijl commented 3 years ago

Issue: reading-in a Dino zipfile returns a ValueError: ValueError: Shape of passed values is (3329, 9), indices imply (3327, 9) For example in tube B22D0155 filter 1

This occurs while reshaping in io_dino.py line 297 measurements = pd.concat([measurements, s], axis=1)

This filter has a duplicate time-index that is the probable culprit.

Since Dinoloket is in a frozen state (no more data will be added by TNO), perhaps we can change _read_dino_groundwater_measurements to accommodate?

proposed change from line 156 of io_dino.py:

        try:
            measurements = pd.read_csv(f, header=None, names=titel,
                                       parse_dates=['peildatum'],
                                       index_col='peildatum',
                                       dayfirst=True,
                                       usecols=usecols)
            measurements = measurements[~measurements.index.duplicated(keep='last')]

OnnoEbbens commented 3 years ago

Thanks for pointing this out. I think your proposed change will work.

Could you send me the csv file of tube B22D0155 filter 1? Then I can check and run some tests. You can upload the file here on github

jvansijl commented 3 years ago

thanks Onno.

B22D0155001_1.zip

dbrakenhoff commented 3 years ago

Back in the DINO api-days files could have any number of duplicate measurements because the data was originally measured on a shorter measurement frequency, but the datetime-index only had a daily resolution.

That is unfortunately no longer an issue, but I think we should still support returning the data exactly as is (including duplicates) if the user wants. The default should be to drop them, and perhaps a warning that measurements are being dropped is a good idea too.

EDIT: typos

OnnoEbbens commented 3 years ago

I agree with Davíd that it is nice to have the option to return the data exactly as is (including duplicates). I fixed the error with duplicate indices. With the latest commit in dev you should be able to read the csv file. For now it will return the measurements with duplicate indices.

I will leave this issue open because I still want to create the option in read_dino_groundwater_csv to drop the duplicates as suggested.

OnnoEbbens commented 3 years ago

I added an optional argument to read_dino_groundwater_csv to remove duplicate indices. I used the code suggestion from @jvansijl for this. Should be available in the dev branch now.