barronh / pseudonetcdf

PseudoNetCDF like NetCDF except for many scientific format backends
GNU Lesser General Public License v3.0
77 stars 35 forks source link

Pseudonetcdf engine with Mfdataset using multiple csv files #74

Closed ArmanAttaran closed 4 years ago

ArmanAttaran commented 4 years ago

Hello, I am attempting to open about 50 csv, each represents monthly data and I would like to open them all in xarray. unfortunately it is not working and I have included the code and error, any help would be appreciated it

import xarray as xr

xr.open_mfdataset('C:/Users/*.csv', concat_dim="time", data_vars='minimal', coords='minimal', compat='override',engine='pseudonetcdf', backend_kwargs={'format': 'csv'})

DtypeWarning: Columns (12) have mixed types. Specify dtype option on import or set low_memory=False.
  file, _ = self._acquire_with_cache_info(needs_lock)
barronh commented 4 years ago

Arman,

It would be easiest if I could test with a few of the files. Can you post several or point me to the source online?

Also, what version of pseudonetcdf and xarray are you using?

ArmanAttaran commented 4 years ago

I am using 3.1.0 PseudoNetCDF and 0.14 xarray.

I have uploaded a few of the files over there.

Thank you.

LA_taxes.zip

barronh commented 4 years ago

First, the message you shared is related to the heuristics used in opening a csv file by pandas. You should be able to add low_memory=False to the backend_kwargs to not see that message. I did not see that message with the files you shared.

Second, I was able to successfully open those files with PseudoNetCDF, but something seems amiss. The files you shared are not consistent with your attempt to open them. Notice that you are trying to stack on time and the files you shared with me do not have a time column.

To make these files work with PseudoNetCDF, I simply used the coordskey keyword to make the id column the dimension and then stacked on id

from glob import glob
import PseudoNetCDF as pnc

f = pnc.pncmfopen(sorted(glob('*_????.csv')), stackdim='id', format='csv', coordkeys=['id'])

To make them work in xarray, I used the coordskey in the backend_kwargs and then updated your open_mfdataset call to be consistent with the current api:

import xarray as xr

f = xr.open_mfdataset(
    '*_????.csv', combine='by_coords',
    data_vars='minimal', coords='minimal', compat='override',
    engine='pseudonetcdf', backend_kwargs={'format': 'csv', 'coordkeys': ('id',)}
)

The data_vars, coords, and compat options were purely optional.

If you still get the warning, add low_memory=False to the backend_kwargs.