dr-leo / pandaSDMX

Python interface to SDMX
Apache License 2.0
126 stars 58 forks source link

Response too large due to client request #4

Closed geoffwright240 closed 9 years ago

geoffwright240 commented 9 years ago

With some queries, the sdmx server responds with the following message ...

Message code="413" Response too large due to client request Query size exceeds maximum limit (1000000 entries).

-- Code to replicate error -- estat = client('Eurostat', 'eurodb.db') db = estat.get_dataflows()

gdp_table = db.execute('SELECT * FROM ESTAT_dataflows WHERE title LIKE "%GDP%"') gdp_list = gdp_table.fetchall()

bop = gdp_list[1] df, md = estat.get_data(bop, '', concat = True) # Request is too large and this call fails silently.

In theory, the filters parameter in the get_data function should help reduce the size of the request.

geoffwright240 commented 9 years ago

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

dr-leo commented 9 years ago

Thanks. You are right: This error is not caught although it should be. And it will be in v0.2 (see below).

I have embarked on a complete rewrite. v0.1.2 was mainly a proof of concept. I have since liaised with the creators of SDMX who gave me some advice. End November I decided to rewrite the whole thing. I have looked into configuration and type validation through IPython.utils.traitlets and IPython.config.configurable, application-wide logging etc.

I have dropped the SQLite stuff. It will be replaced by xpath queries on the XML document. At the heart of the new architecture there will be a pythonic representation of the SDMX information model (see Part 2 of the SDMX2.1 Standard and the user guide on www.sdmx.org). The information model is separate from the code relating to XML, JSON etc. in the reader package, and pandas, xray, Excel, SQLAlchemy, smoke signals etc. in the writer package. The information model classes call reader methods which execute xpath queries on the xml document. These method calls are wrapped in properties.

The code is evolving but far from stable. There is an incredible amount of work remaining. But I believe the goal is worth the effort. Some one has to do it, not necessarily myself.

On v0.2dev: The information model in model.py is heavily under construction. There is a Message class representing an SDMX message. It will allow error detection after downloading.

Please let me know any thoughts you might have.

Leo

Am 02.01.2015 um 02:09 schrieb geoffwright240:

With some queries, the sdmx server responds with the following message ...

footer:FooterResponse too large due to client request/common:TextQuery size exceeds maximum limit (1000000 entries)./common:Text/footer:Message/footer:Footer

Generic setup

estat = client('Eurostat', 'eurodb.db') db = estat.get_dataflows()

gdp_table = db.execute('SELECT * FROM ESTAT_dataflows WHERE title LIKE "%GDP%"') gdp_list = gdp_table.fetchall()

bop = gdp_list[1] df, md = estat.get_data(bop, '', concat = True) # Request is too large and this call fails silently.

In theory, the filters parameter in the get_data function should help reduce the size of the request.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4.

dr-leo commented 9 years ago

Thanks. Very useful.

The forthcoming v0.2 will support filter creation pythoniccally rather than through a bare function argument. Any help is much appreciated. To get a first impression check out the dev versions regularly. But don't expect anything to work as expected. Currently I am working on the pythonic representation of structures such as codelists and conceptschemes. Datastructure-Definitions will follow, only thereafter datasets.

Am 02.01.2015 um 14:27 schrieb geoffwright240:

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68526378.

geoffwright240 commented 9 years ago

0.2dev looks very promising … although I don’t see model.py in the repo yet.

I was extremely happy to see this though as I was just in the process of trying to figure out how to solve this issue. I almost wonder if this bit of code should be in pandas as opposed to pandaSDMX. I also wonder what the best way to handle this is … the time period we are talking about is actually a quarter so not totally correct to convert to first month. (I wonder how other languages treat this. I recall from many years ago that FAME had a quarter time period defined.)

Time span conversions not recognised by pandas:

time_spans = { 'Q1' : '01-01', 'Q2' : '04-01', 'Q3' : '07-01', 'Q4' : '10-01', 'S1' : '01-01', 'S2' : '07-01' }

Geoff

On Jan 2, 2015, at 4:22 PM, dr-leo notifications@github.com wrote:

Thanks. Very useful.

The forthcoming v0.2 will support filter creation pythoniccally rather than through a bare function argument. Any help is much appreciated. To get a first impression check out the dev versions regularly. But don't expect anything to work as expected. Currently I am working on the pythonic representation of structures such as codelists and conceptschemes. Datastructure-Definitions will follow, only thereafter datasets.

Am 02.01.2015 um 14:27 schrieb geoffwright240:

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68526378.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68563571.

dr-leo commented 9 years ago

I double-checked that model.py is tracked. It should be on github for quite a while. Have you updated to tip? the v0.2dev branch?

On pandas time spans: Agreed. I wanted to open an issue on pydata/pandas but I wasn't entirely sure that I hadn't overlooked anything. There is so much debate on time spans on the pydata mailing list... I take your comment as a nudge to raise this question on the list. Leo Am 03.01.2015 um 01:22 schrieb geoffwright240:

0.2dev looks very promising … although I don’t see model.py in the repo yet.

I was extremely happy to see this though as I was just in the process of trying to figure out how to solve this issue. I almost wonder if this bit of code should be in pandas as opposed to pandaSDMX. I also wonder what the best way to handle this is … the time period we are talking about is actually a quarter so not totally correct to convert to first month. (I wonder how other languages treat this. I recall from many years ago that FAME had a quarter time period defined.)

Time span conversions not recognised by pandas:

time_spans = { 'Q1' : '01-01', 'Q2' : '04-01', 'Q3' : '07-01', 'Q4' : '10-01', 'S1' : '01-01', 'S2' : '07-01' }

Geoff

On Jan 2, 2015, at 4:22 PM, dr-leo notifications@github.com wrote:

Thanks. Very useful.

The forthcoming v0.2 will support filter creation pythoniccally rather than through a bare function argument. Any help is much appreciated. To get a first impression check out the dev versions regularly. But don't expect anything to work as expected. Currently I am working on the pythonic representation of structures such as codelists and conceptschemes. Datastructure-Definitions will follow, only thereafter datasets.

Am 02.01.2015 um 14:27 schrieb geoffwright240:

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68526378.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68563571.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68575639.

geoffwright240 commented 9 years ago

Leo,

Sorry. My mistake — yes, model.py is in v0.2dev branch. Also, don’t let me slow you down as there is a ton of learning I need to do before being able to contribute in a meaningful way.

I read http://pandas.pydata.org/pandas-docs/stable/timeseries.html more carefully and think that using a Period is the way to go for handling quarterly time series that are returned by Eurostat. (There could be something more but this looks like a decent enough solution for now.) p = Period('2012Q1', freq='Q-DEC') What would be the best way for me to help contribute ? Once v0.2dev is close, I could help with the documentation. Also, would it make sense for me to write a couple of unit tests ? Not too sure exactly what this entails but I could dig into it if that makes the most sense.

If you are interested in my motivation for the pandaSDMX project, I am looking at this is to integrate links to SDMX web services into a little tool that I am building for Excel. I am not a huge fan of Excel but it is so ubiquitous (and what I use at my job that pays the mortgage) so there is a huge user community there. The tool is open source and I would like to eventually port to Openoffice and gnumeric. My project is very rudimentary at this stage but it generally works as hoped. I have not put it up on Github yet as the code is way too premature to have in the public.

Once complete, a user will only have to put the following function into a MS Excel cell and it will pull a time series into your spreadsheet.

=DataReader(series:‘GDP’,db:’fred’,startdate:’2000-01-01’,enddate:’2014-01-01’,showdates:’True’,...)

This is similar to functionality you can get through Bloomberg’s BDH, SunGard’s FAMEPopulator and S&P’s CapitalIQ but you are not restricted to their proprietary databases (which are in large part just aggregators for public data anyways.)

My approach only works on Mac currently as the implementation of xlwings is not async and therefore gets trapped by limitations on Microsofts com restrictions. The guys behind xlwings have a fix for this that they are currently implementing.

Anyways, not to distract you from your projects at hand, but if you are interested here is a little bundle of code and a video to give you an idea of what I am trying to do. (It’s probably just easiest to watch the video but happy to take any comments you have on the code.)

http://wrightville.org/rel/xlwings_dl.zip

All the best Leo !

Geoff

Le 2015-01-03 à 03:11, dr-leo notifications@github.com a écrit :

I double-checked that model.py is tracked. It should be on github for quite a while. Have you updated to tip? the v0.2dev branch?

On pandas time spans: Agreed. I wanted to open an issue on pydata/pandas but I wasn't entirely sure that I hadn't overlooked anything. There is so much debate on time spans on the pydata mailing list... I take your comment as a nudge to raise this question on the list. Leo Am 03.01.2015 um 01:22 schrieb geoffwright240:

0.2dev looks very promising … although I don’t see model.py in the repo yet.

I was extremely happy to see this though as I was just in the process of trying to figure out how to solve this issue. I almost wonder if this bit of code should be in pandas as opposed to pandaSDMX. I also wonder what the best way to handle this is … the time period we are talking about is actually a quarter so not totally correct to convert to first month. (I wonder how other languages treat this. I recall from many years ago that FAME had a quarter time period defined.)

Time span conversions not recognised by pandas:

time_spans = { 'Q1' : '01-01', 'Q2' : '04-01', 'Q3' : '07-01', 'Q4' : '10-01', 'S1' : '01-01', 'S2' : '07-01' }

Geoff

On Jan 2, 2015, at 4:22 PM, dr-leo notifications@github.com wrote:

Thanks. Very useful.

The forthcoming v0.2 will support filter creation pythoniccally rather than through a bare function argument. Any help is much appreciated. To get a first impression check out the dev versions regularly. But don't expect anything to work as expected. Currently I am working on the pythonic representation of structures such as codelists and conceptschemes. Datastructure-Definitions will follow, only thereafter datasets.

Am 02.01.2015 um 14:27 schrieb geoffwright240:

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68526378.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68563571.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68575639.

— Reply to this email directly or view it on GitHub.

dr-leo commented 9 years ago

Dear Jeff,

I am very happy about your interest in pandaSDMX. Apparently Excel users including myself could greatly benefit from your connector. Once pandaSDMX has matured, you may want to focus on a writer for Excel if you feel the detour through pandas has too many pitfalls.

  1. Excel I hadn't heard about xlwings before, and I know little about Excel except I have to use it on the job as well, but hardly for data analysis. A limitation on Mac would seem to be something you want to overcome in your clients' interest. There are a couple of Excel-related libs for Python. You will have looked at OpenPyXL that ships with Anaconda. And there is this nice weekly Python newsletter that talks a lot about blog posts and books on data analysis including with Excel and pandas. I'll forward you the latest one with an announcement on Yve Hilpisch's most recent title.
  2. Contributing I am sure you will identify areas where you can best contribute. You are already of such tremendous help. I face a steep learning curve, but Python makes it easy to enter new territory. And I have spent some time reading through the SDMX standard. When I decided to release pandaSDMX 0.1 I was aware of many deficiencies. But I still wanted to release it to have something to show to the SDMX community. And it has paid off: I am in contact with a very nice guy from the ECB who has given some great advice. The leading company on SDMX software is UK-based Metadata Technologies. But their free product, see www.sdmxsource.org, seems untractable. I will never learn Java or .NET. Interestingly, they have developed browser-based SDMX software written in JavaScript. I understand they can also export to Excel as central bankers use Excel for macro-economic analysis.
  3. On time spans: Yes, Pandas supports time spans (periods). But the key question is how to generate a period index from a list of strings gleened from an XML document. pandas.to_datetime() fails. So I have asked on the pydata mailing list. Let's see what the experts say. - At the end of the day, i.e. once pandaSDMX 0.2 is ready, this will be no big deal as SDMX DataStructureDefinitions contain an abstract description of the datetime index which will then be generated exactly as you set out below. But when relying on the generic dataset format, each observation is related to a time representation such as "2007-Q3". So you have to digest long lists of these strings. If pandas cannot parse them, you have to work around this gap. Hence the little dictionary you compared to FAME. By the way, it is very valuable that you seem to have some familiarity with Bloomberg, S&P etc. So you'll have good ideas on usability. I work in financial supervision, but as a lawyer, not statisticion, so I have no experience with Bloomberg, Thomson-Reuters etc.

In order to get a good understanding of XML within the SDMX context, you want to make sure to be able to read XML documents. Eclipse seems like a good environment for this, but there are many IDE's out there.

Anyway, time spans are details for later. the next goal is to get the Message class in the model right. I'll have to read through the SDMX-ML specification (= the XML representation for SDMX) to find out about the different message types, attributes and children and create a pythonic API for that. This is no rocket science. But it requires good intuition for usability and the most concise and elegant way of expressing the structures. I recommend looking at the pandasdmx.agency.Agency class as well as the utils.HasItems class (the latter being under construction). I am working on it. I hope to allow indexing and dict-like attribute access on XML attributes and nodes, all through properties. Don't know if this is too fancy... I'll commit the latest ideas to github now. I think I have a decent implementation for ItemSchemes such as Codelists and ConceptSchemes. Extending it to CategorySchemes and Dataflows will be very easy. But the DataStructureDefinition itself with all those datatypes (facets) will be a bit of a challenge I think. And then constraints...

So much for now. Please ignore from the above whatever bores you or looks too messy.

Again, I am very glad you are interested in my little pet project and I am hopeful we can both benefit from each other's skills and efforts.

Leo

Am 03.01.2015 um 15:11 schrieb geoffwright240:

Leo,

Sorry. My mistake — yes, model.py is in v0.2dev branch. Also, don’t let me slow you down as there is a ton of learning I need to do before being able to contribute in a meaningful way.

I read http://pandas.pydata.org/pandas-docs/stable/timeseries.html more carefully and think that using a Period is the way to go for handling quarterly time series that are returned by Eurostat. (There could be something more but this looks like a decent enough solution for now.) p = Period('2012Q1', freq='Q-DEC') What would be the best way for me to help contribute ? Once v0.2dev is close, I could help with the documentation. Also, would it make sense for me to write a couple of unit tests ? Not too sure exactly what this entails but I could dig into it if that makes the most sense.

If you are interested in my motivation for the pandaSDMX project, I am looking at this is to integrate links to SDMX web services into a little tool that I am building for Excel. I am not a huge fan of Excel but it is so ubiquitous (and what I use at my job that pays the mortgage) so there is a huge user community there. The tool is open source and I would like to eventually port to Openoffice and gnumeric. My project is very rudimentary at this stage but it generally works as hoped. I have not put it up on Github yet as the code is way too premature to have in the public.

Once complete, a user will only have to put the following function into a MS Excel cell and it will pull a time series into your spreadsheet.

=DataReader(series:‘GDP’,db:’fred’,startdate:’2000-01-01’,enddate:’2014-01-01’,showdates:’True’,...)

This is similar to functionality you can get through Bloomberg’s BDH, SunGard’s FAMEPopulator and S&P’s CapitalIQ but you are not restricted to their proprietary databases (which are in large part just aggregators for public data anyways.)

My approach only works on Mac currently as the implementation of xlwings is not async and therefore gets trapped by limitations on Microsofts com restrictions. The guys behind xlwings have a fix for this that they are currently implementing.

Anyways, not to distract you from your projects at hand, but if you are interested here is a little bundle of code and a video to give you an idea of what I am trying to do. (It’s probably just easiest to watch the video but happy to take any comments you have on the code.)

http://wrightville.org/rel/xlwings_dl.zip

All the best Leo !

Geoff

Le 2015-01-03 à 03:11, dr-leo notifications@github.com a écrit :

I double-checked that model.py is tracked. It should be on github for quite a while. Have you updated to tip? the v0.2dev branch?

On pandas time spans: Agreed. I wanted to open an issue on pydata/pandas but I wasn't entirely sure that I hadn't overlooked anything. There is so much debate on time spans on the pydata mailing list... I take your comment as a nudge to raise this question on the list. Leo Am 03.01.2015 um 01:22 schrieb geoffwright240:

0.2dev looks very promising … although I don’t see model.py in the repo yet.

I was extremely happy to see this though as I was just in the process of trying to figure out how to solve this issue. I almost wonder if this bit of code should be in pandas as opposed to pandaSDMX. I also wonder what the best way to handle this is … the time period we are talking about is actually a quarter so not totally correct to convert to first month. (I wonder how other languages treat this. I recall from many years ago that FAME had a quarter time period defined.)

Time span conversions not recognised by pandas:

time_spans = { 'Q1' : '01-01', 'Q2' : '04-01', 'Q3' : '07-01', 'Q4' : '10-01', 'S1' : '01-01', 'S2' : '07-01' }

Geoff

On Jan 2, 2015, at 4:22 PM, dr-leo notifications@github.com wrote:

Thanks. Very useful.

The forthcoming v0.2 will support filter creation pythoniccally rather than through a bare function argument. Any help is much appreciated. To get a first impression check out the dev versions regularly. But don't expect anything to work as expected. Currently I am working on the pythonic representation of structures such as codelists and conceptschemes. Datastructure-Definitions will follow, only thereafter datasets.

Am 02.01.2015 um 14:27 schrieb geoffwright240:

For those that are looking at using filters in request, the following code shows an example.

from pandasdmx import client estat = client("Eurostat") estat.get_data("namq_10_gdp",'.CLV10_MEUR.SWDA.B1GQ.DE', concat = False)

First argument is the flowref, second argument is the "key". This is where you add filters to your request.

A useful website for those looking to learn more about Eurostat's web services can be found here:

http://ec.europa.eu/eurostat/web/sdmx-web-services/data-struct-def

and

http://ec.europa.eu/eurostat/web/sdmx-web-services/a-few-useful-points

— Reply to this email directly or view it on GitHub

https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68526378.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68563571.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68575639.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/dr-leo/pandaSDMX/issues/4#issuecomment-68596035.

dr-leo commented 9 years ago

Fixed in v0.2.0. v0.2.1 will add native support to handle the zip files returned by Eurostat.