dr-leo / pandaSDMX

Python interface to SDMX
Apache License 2.0
127 stars 59 forks source link

XMLParseError when requesting data using a dictionary key from ABS_XML #253

Open Chowti opened 9 months ago

Chowti commented 9 months ago

Using Python 3.11.7 pandasdmx 1.10.0

I am getting an XMLParseError while attempting to get data using a dictionary key from "ABS_XML".

import pandasdmx as sdmx

abs_xml = sdmx.Request("ABS_XML")

resp = abs_xml.data('ABS_ANNUAL_ERP_LGA2022',
                    key = dict(SEX_ABS='1'),
                    params = dict(startPeriod='2021'))
Traceback ``` [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\remote.py:11](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/remote.py:11): RuntimeWarning: optional dependency requests_cache is not installed; cache options to Session() have no effect warn( --- SS without DSD --- {1: False} --- --- {2:
id: 'IDREF59600' prepared: '2024-02-05T17:21:23.770127+11:00' receiver: sender: source: test: False} --- --- {'ABS_ANNUAL_ERP_LGA2022': } --- --- {'ABS': } --- --- {'ABS_ANNUAL_ERP_LGA2022': } --- --- {87: , 88: } --- --- {'CAT_ANNUAL_ERP_LGA2022': } --- --- {'CL_AGE': , 'CL_ERP': , 'CL_FREQ': , 'CL_LGA_2022': , 'CL_OBS_STATUS': , 'CL_REGION_TYPE': , 'CL_SEX': , 'CL_UNIT_MEASURE': } --- --- {11693: , 11694: , 11700: , 11701: , 'CS_DEMOG': , 11712: , 11713: , 'CS_GEOGRAPHY': , 'CS_COMMON': , 11734: , 11735: , 'CS_ATTRIBUTE': } --- --- {'obs_count': Annotation(id='obs_count', title='698478', type='sdmx_metrics', url=None, text=), 11758: Annotation(id=None, title='A', type='ReleaseVersion', url=None, text=)} --- Name --- {11759: ('en', 'Availability (A) for ABS_ANNUAL_ERP_LGA2022')} --- --- {'ABS_ANNUAL_ERP_LGA2022': } --- --- {11762: , 11766: , 11786: , 12344: , 12348: , 12350: } --- --- {12353: RangePeriod(start=Period(is_inclusive=True, period=datetime.datetime(2001, 1, 1, 0, 0)), end=Period(is_inclusive=True, period=datetime.datetime(2022, 12, 31, 0, 0)))} ``` ```python-traceback --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\reader\sdmxml.py:299](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:299), in Reader.read_message(self, source, dsd) [297](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:297) try: [298](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:298) # Parse the element --> [299](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:299) result = func(self, element) [300](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:300) except TypeError: File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\reader\sdmxml.py:1190](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:1190), in _ms(reader, elem) [1189](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:1189) else: -> [1190](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:1190) raise RuntimeError [1192](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:1192) if arg["values_for"] is None: RuntimeError: The above exception was the direct cause of the following exception: XMLParseError Traceback (most recent call last) File [c:\Pystuff\pandasdmx\Fresh.py:6](file:///C:/Pystuff/pandasdmx/Fresh.py:6) [2](file:///C:/Pystuff/pandasdmx/Fresh.py:2) import pandasdmx as sdmx [4](file:///C:/Pystuff/pandasdmx/Fresh.py:4) abs_xml = sdmx.Request("ABS_XML") ----> [6](file:///C:/Pystuff/pandasdmx/Fresh.py:6) resp = abs_xml.data('ABS_ANNUAL_ERP_LGA2022', [7](file:///C:/Pystuff/pandasdmx/Fresh.py:7) key = dict(SEX_ABS='1'), [8](file:///C:/Pystuff/pandasdmx/Fresh.py:8) params = dict(startPeriod='2021')) File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\api.py:457](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:457), in Request.get(self, resource_type, resource_id, tofile, use_cache, dry_run, **kwargs) [455](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:455) req = self._request_from_url(kwargs) [456](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:456) else: --> [457](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:457) req = self._request_from_args(kwargs) [459](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:459) req = self.session.prepare_request(req) [461](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:461) # Now get the SDMX message via HTTP File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\api.py:287](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:287), in Request._request_from_args(self, kwargs) [283](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:283) raise ValueError(f"unrecognized arguments: {kwargs!r}") [285](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:285) if validate: [286](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:286) # Make the key, and retain the DSD (if any) for use in parsing --> [287](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:287) key, dsd = self._make_key(resource_type, resource_id, key, dsd) [288](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:288) kwargs["dsd"] = dsd [290](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:290) url_parts.append(key) File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\api.py:184](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:184), in Request._make_key(self, resource_type, resource_id, key, dsd) [180](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:180) pass [181](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:181) elif self.source.supports[Resource.datastructure]: [182](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:182) # Retrieve the DataStructureDefinition [183](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:183) dsd = ( --> [184](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:184) self.dataflow( [185](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:185) resource_id, params=dict(references="all"), use_cache=True [186](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:186) ) [187](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:187) .dataflow[resource_id] [188](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:188) .structure [189](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:189) ) [191](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:191) if dsd.is_external_reference: [192](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:192) # DataStructureDefinition was not retrieved with the Dataflow [193](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:193) # query; retrieve it explicitly [194](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:194) dsd = self.get(resource=dsd, use_cache=True).structure[dsd.id] File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\api.py:514](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:514), in Request.get(self, resource_type, resource_id, tofile, use_cache, dry_run, **kwargs) [511](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:511) reader = Reader() [513](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:513) # Parse the message, using any provided or auto-queried DSD --> [514](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:514) msg = reader.read_message(response_content, dsd=kwargs.get("dsd", None)) [516](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:516) # Store the HTTP response with the message [517](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/api.py:517) msg.response = response File [c:\Users\timot\anaconda3\envs\SDMX\Lib\site-packages\pandasdmx\reader\sdmxml.py:317](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:317), in Reader.read_message(self, source, dsd) [315](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:315) self._dump() [316](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:316) print(etree.tostring(element, pretty_print=True).decode()) --> [317](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:317) raise XMLParseError from exc [319](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:319) # Parsing complete; count uncollected items from the stacks, which represent [320](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:320) # parsing errors [321](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:321) [322](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:322) # Remove some internal items [323](file:///C:/Users/timot/anaconda3/envs/SDMX/Lib/site-packages/pandasdmx/reader/sdmxml.py:323) self.pop_single("SS without DSD") XMLParseError: RuntimeError ```

The error looks to occur when trying to get the dsd structure information.

dsd = abs_xml.dataflow('ABS_ANNUAL_ERP_LGA2022', params=dict(references="all"), use_cache=True).dataflow['ABS_ANNUAL_ERP_LGA2022'].structure

Specifying references=descendants and then using the information returned, allows the data request to complete successfully.

dsd = abs_xml.dataflow('ABS_ANNUAL_ERP_LGA2022', params=dict(references="descendants"), use_cache=True).dataflow['ABS_ANNUAL_ERP_LGA2022'].structure

resp = abs_xml.data('ABS_ANNUAL_ERP_LGA2022',
                    key = dict(SEX_ABS='1'),
                    params = dict(startPeriod='2021'),
                    dsd = dsd)
<pandasdmx.DataMessage>
  <Header>
    id: 'IREF030445'
    prepared: '2024-02-08T02:01:51'
    sender: <Agency _Stat_V8>
    source: 
    test: False
  response: <Response [200]>
  DataSet (1)
  dataflow: <DataflowDefinition (missing id)>
  observation_dimension: <TimeDimension TIME_PERIOD>

My main suspicion would be parsing the 2 content constraints returned from, https://api.data.abs.gov.au/dataflow/ABS/ABS_ANNUAL_ERP_LGA2022/latest?references=all. These are automatically generated during a point in time release, https://sis-cc.gitlab.io/dotstatsuite-documentation/using-api/embargo-management/#point-in-time-release-feature

BartStolarek commented 8 months ago

Unfortunately I think I'm in the same boat, I'm new to using pandasdmx and actually sdmx structures as well, but here is my code:

from pandasdmx import Request
import logging
import pandasdmx

abs_xml = pandasdmx.Request('ABS_XML',
                            log_level=logging.INFO)

# Dataflows
flow_msg = abs_xml.dataflow(force=True) # get dataflows
dataflows_pandas = pandasdmx.to_pandas(flow_msg.dataflow) # convert to pandas DataFrame
dataflows_pandas.to_csv('dataflows.csv')  # save dataflows to csv
sa2Data = dataflows_pandas[dataflows_pandas.str.contains('SA2+', case=False)] # filter dataflows for SA2
sa2Data.to_csv('sa2DataFlows.csv') # save SA2 dataflows to csv

example_msg = abs_xml.dataflow(resource=flow_msg.dataflow.C21_G04_SA2) # get dataflow for C21_G04_SA2

When I run that, I get the following RuntimeError

../venv/lib/python3.10/site-packages/pandasdmx/remote.py:11: RuntimeWarning: optional dependency requests_cache is not installed; cache options to Session() have no effect
  warn(
2024-02-22 14:58:04,233 pandasdmx.api - INFO: Requesting resource from https://api.data.abs.gov.au/dataflow/ABS/latest
2024-02-22 14:58:04,233 pandasdmx.api - INFO: with headers {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2024-02-22 14:58:07,415 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2024-02-22 14:58:08,110 pandasdmx.api - INFO: Requesting resource from https://api.data.abs.gov.au/dataflow/ABS/C21_G04_SA2/latest?references=all
2024-02-22 14:58:08,110 pandasdmx.api - INFO: with headers {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2024-02-22 14:58:10,468 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>

--- SS without DSD ---
{1: False}

--- <class 'pandasdmx.message.StructureMessage'> ---
{2: <pandasdmx.StructureMessage>
  <Header>
    id: 'IDREF23404'
    prepared: '2024-02-22T13:40:55.674541+11:00'
    receiver: <Agency Unknown>
    sender: <Agency Unknown>
    source: 
    test: False}

--- <class 'pandasdmx.model.DataStructureDefinition'> ---
{'C21_G04_SA2': <DataStructureDefinition ABS:C21_G04_SA2(1.0.0): Census 2021, G04 Age by sex, Main Statistical Areas Level 2 and up (SA2+) Datastructure>}

--- <class 'pandasdmx.model.Agency'> ---
{'ABS': <Agency ABS>}

--- <class 'pandasdmx.model.DataflowDefinition'> ---
{'C21_G04_SA2': <DataflowDefinition ABS:C21_G04_SA2(1.0.0): Census 2021, G04 Age by sex, Main Statistical Areas Level 2 and up (SA2+)>}

--- <class 'pandasdmx.model.CategoryScheme'> ---
{63: <CategoryScheme ABS:C21_ASGS(1.0.0) (5 items): Census 2021>, 64: <CategoryScheme ABS:C21_ASGS(1.0.0) (1 items)>}

--- <class 'pandasdmx.model.Categorisation'> ---
{'CAT_C21_G04_SA2': <Categorisation ABS:CAT_C21_G04_SA2(1.0.0): Census 2021, G04 Age by sex, Main Statistical Areas Level 2 and up (SA2+) Categorisation>}

--- <class 'pandasdmx.model.Codelist'> ---
{'CL_ASGS_2021': <Codelist ABS:CL_ASGS_2021(1.0.0) (2985 items): Australian Statistical Geography Standard (ASGS) Edition 3 - Main Structure>, 'CL_C21_AGEINGP13': <Codelist ABS:CL_C21_AGEINGP13(1.0.0) (102 items): Age, excludes overseas vistitors 13>, 'CL_C21_SEXP01': <Codelist ABS:CL_C21_SEXP01(1.0.0) (3 items): Sex 01>, 'CL_REGION_TYPE': <Codelist ABS:CL_REGION_TYPE(1.0.0) (43 items): Region Type>, 'CL_STATE': <Codelist ABS:CL_STATE(1.0.0) (10 items): State>}

--- <class 'pandasdmx.model.ConceptScheme'> ---
{52106: <ConceptScheme ABS:CS_C21_PERSON(1.0.0) (120 items): Census 2021 Person Concepts>, 52107: <ConceptScheme ABS:CS_C21_PERSON(1.0.0) (1 items)>, 'CS_C21_PERSON': <ConceptScheme ABS:CS_C21_PERSON(1.0.0) (1 items)>, 52118: <ConceptScheme ABS:CS_GEOGRAPHY(1.0.0) (25 items): Geography Concepts>, 52119: <ConceptScheme ABS:CS_GEOGRAPHY(1.0.0) (1 items)>, 'CS_GEOGRAPHY': <ConceptScheme ABS:CS_GEOGRAPHY(1.0.0) (2 items)>, 52134: <ConceptScheme ABS:CS_COMMON(1.0.0) (5 items): Common Concepts>, 52135: <ConceptScheme ABS:CS_COMMON(1.0.0) (1 items)>, 'CS_COMMON': <ConceptScheme ABS:CS_COMMON(1.0.0) (1 items)>}

--- <class 'pandasdmx.model.Annotation'> ---
{'obs_count': Annotation(id='obs_count', title='912186', type='sdmx_metrics', url=None, text=), 52148: Annotation(id=None, title='A', type='ReleaseVersion', url=None, text=)}

--- Name ---
{52149: ('en', 'Availability (A) for C21_G04_SA2')}

--- <class 'pandasdmx.reader.sdmxml.Reference'> ---
{'C21_G04_SA2': <pandasdmx.reader.sdmxml.Reference object at 0x7fdd3198b970>}

--- <class 'pandasdmx.model.MemberSelection'> ---
{52253: <MemberSelection AGEINGP in {'_T', '0', '0_4', '1', '10', '10_14', '11', '12', '13', '14', '15', '15_19', '16', '17', '18', '19', '2', '20', '20_24', '21', '22', '23', '24', '25', '25_29', '26', '27', '28', '29', '3', '30', '30_34', '31', '32', '33', '34', '35', '35_39', '36', '37', '38', '39', '4', '40', '40_44', '41', '42', '43', '44', '45', '45_49', '46', '47', '48', '49', '5', '5_9', '50', '50_54', '51', '52', '53', '54', '55', '55_59', '56', '57', '58', '59', '6', '60', '60_64', '61', '62', '63', '64', '65', '65_69', '66', '67', '68', '69', '7', '70', '70_74', '71', '72', '73', '74', '75', '75_79', '76', '77', '78', '79', '8', '80_84', '85_89', '9', '90_94', '95_99', 'GE100'}>, 52257: <MemberSelection SEXP in {'1', '2', '3'}>, 55239: <MemberSelection REGION in {'1', '101', ...<truncated>..., '9OTER', 'AUS'}>, 55246: <MemberSelection REGION_TYPE in {'AUS', 'GCCSA', 'SA2', 'SA3', 'SA4', 'STE'}>, 55257: <MemberSelection STATE in {'1', '2', '3', '4', '5', '6', '7', '8', '9', 'AUS'}>}

--- <class 'pandasdmx.model.RangePeriod'> ---
{55260: RangePeriod(start=Period(is_inclusive=True, period=datetime.datetime(2021, 1, 1, 0, 0)), end=Period(is_inclusive=True, period=datetime.datetime(2021, 12, 31, 0, 0)))}

<common:KeyValue xmlns:common="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common" xmlns:message="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:structure="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure" id="TIME_PERIOD">
            <common:TimeRange/></common:KeyValue>

Traceback (most recent call last):
  File "../venv/lib/python3.10/site-packages/pandasdmx/reader/sdmxml.py", line 299, in read_message
    result = func(self, element)
  File "../venv/lib/python3.10/site-packages/pandasdmx/reader/sdmxml.py", line 1189, in _ms
    raise RuntimeError
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "../main.py", line 17, in <module>
    example_msg = abs_xml.dataflow(resource=flow_msg.dataflow.C21_G04_SA2) # get dataflow for C21_G04_SA2
  File "..r/venv/lib/python3.10/site-packages/pandasdmx/api.py", line 514, in get
    msg = reader.read_message(response_content, dsd=kwargs.get("dsd", None))
  File "../venv/lib/python3.10/site-packages/pandasdmx/reader/sdmxml.py", line 316, in read_message
    raise XMLParseError from exc
pandasdmx.exceptions.XMLParseError: RuntimeError