OliverSherouse / wbdata

A python library for accessing world bank data
GNU General Public License v2.0
182 stars 55 forks source link

cannot handle a non-unique multi index #8

Closed wochner closed 6 years ago

wochner commented 7 years ago

Running the indicators

{u'IC.CRD.INFO.XQ': u'Depth of credit information index (0=low to 8=high)', u'IC.ISV.CPI': u'Creditor participation index (0-4)'}

in the function

df = wbdata.get_dataframe(indicators, convert_date=True)

returns a "cannot handle a non-unique multi index error". Running the two indicators seperately works fine.

Is this a bug or misspecification from my side?

ExceptionTraceback (most recent call last)

in () ----> 1 df = wbdata.get_dataframe(indicators) in get_dataframe(indicators, country, data_date, convert_date, keep_levels) C:\Python27\lib\site-packages\wbdata\api.pyc in uses_pandas(f, *args, **kwargs) 51 if not pd: 52 raise ValueError("Pandas must be installed to be used") ---> 53 return f(*args, **kwargs) 54 55 C:\Python27\lib\site-packages\wbdata\api.pyc in get_dataframe(indicators, country, data_date, convert_date, keep_levels) 415 pandas=True, keep_levels=keep_levels) 416 for i in indicators} --> 417 return pd.DataFrame(to_df) 418 419 C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy) 273 dtype=dtype, copy=copy) 274 elif isinstance(data, dict): --> 275 mgr = self._init_dict(data, index, columns, dtype=dtype) 276 elif isinstance(data, ma.MaskedArray): 277 import numpy.ma.mrecords as mrecords C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype) 409 arrays = [data[k] for k in keys] 410 --> 411 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) 412 413 def _init_ndarray(self, values, index, columns, dtype=None, copy=False): C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype) 5499 5500 # don't force copy because getting jammed in an ndarray anyway -> 5501 arrays = _homogenize(arrays, index, dtype) 5502 5503 # from BlockManager perspective C:\Python27\lib\site-packages\pandas\core\frame.pyc in _homogenize(data, index, dtype) 5798 # Forces alignment. No need to copy data since we 5799 # are putting it into an ndarray later -> 5800 v = v.reindex(index, copy=False) 5801 else: 5802 if isinstance(v, dict): C:\Python27\lib\site-packages\pandas\core\series.pyc in reindex(self, index, **kwargs) 2424 @Appender(generic._shared_docs['reindex'] % _shared_doc_kwargs) 2425 def reindex(self, index=None, **kwargs): -> 2426 return super(Series, self).reindex(index=index, **kwargs) 2427 2428 @Appender(generic._shared_docs['fillna'] % _shared_doc_kwargs) C:\Python27\lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs) 2513 # perform the reindex on the axes 2514 return self._reindex_axes(axes, level, limit, tolerance, method, -> 2515 fill_value, copy).__finalize__(self) 2516 2517 def _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, C:\Python27\lib\site-packages\pandas\core\generic.pyc in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 2526 ax = self._get_axis(a) 2527 new_index, indexer = ax.reindex(labels, level=level, limit=limit, -> 2528 tolerance=tolerance, method=method) 2529 2530 axis = self._get_axis_number(a) C:\Python27\lib\site-packages\pandas\core\indexes\multi.pyc in reindex(self, target, method, level, limit, tolerance) 1861 tolerance=tolerance) 1862 else: -> 1863 raise Exception("cannot handle a non-unique multi-index!") 1864 1865 if not isinstance(target, MultiIndex): Exception: cannot handle a non-unique multi-index!
OliverSherouse commented 7 years ago

I can reproduce. I'll take a look

OliverSherouse commented 7 years ago

Alright, this is a fun one: the series ''IC.ISV.CPI' returns two values for items with the name 'Mexico - Mexico City'. One has the id PK, the other has the id MX. To make things worse, multiple cities have the key MX. I'm not sure how to handle this in accordance with the Law of Least Astonishment. What behavior do you think would be least surprising? I could:

  1. Create a multindex on country id, country value, and date instead of country value and date
  2. Throw the error and trust the user to look into it
  3. Refuse to create a pandas series for any series with duplicate indices
  4. Something else?

I hate option 1 least, though it would be a nontrivial change for current users.

wochner commented 6 years ago

Thanks for looking into it. Would it be an option to exclude subnational data? A number of other world bank data sets have only data on the national level. Or, make it optional to include subnational data?

OliverSherouse commented 6 years ago

It looks like this has been fixed in the API itself.