Attol8 / istatapi

Python API for ISTAT (The Italian National Institute of Statistics)
https://attol8.github.io/istatapi/
Apache License 2.0
29 stars 8 forks source link

New Functionality of filtering dataframe by region #22

Closed fraFerrari99 closed 8 months ago

fraFerrari99 commented 8 months ago

Hi, Your work is incredible, but I want to ask you if there's any way to filter all the dataframes by the region that you decide to pass. I have to try to implement it but it was very lazy since it needed to iterate through each dataframe and open them and look their available regions and see if the one passed is in the available regions of the dataframe.

There's some kind of way to do this in a more efficient way? Thank you so much again for your work!

Attol8 commented 8 months ago

HI @fraFerrari99 thanks for the kind words 😃

If I understand correctly, you want to filter by region all the dataframes available in discovery.all_available()? If that's the case, I am afraid there is no way to do so as that would require loading in every istat dataframe first, even the ones you do not need. Plus, I do believe that region is not available in every dataset as a dimension.

Can you please give me more concrete examples of what you are trying to achieve so that I can look into the issue? What dataframes would you need to load and then filter?

fraFerrari99 commented 8 months ago

Yeah, I was looking for a way for filtering all the datasets available passing a specific region code but I was encountering the problems that you wrote, above all that was something very very slow.

For this reason I hoped that you could find a way to do it since it could be a very helpful functionality but no worry, I would find another way to solve this problem!

Thanks!

Attol8 commented 8 months ago

that is not a functionality we can support as loading all the datasets would also be expensive in terms of requests to the server. I suggest you:

  1. Create a list of datasets you want to load
  2. loop through the list and check if region is available as a filter
  3. if it is, retrieve the data

Below is some pseudo-code to do so. I am not sure how the region dimension is called and what values it contains. Should be something like:

from istatapi import discovery, retrieval

def retrieve_datasets_with_region_filter(dataflow_identifiers, region):
    """
    Retrieve datasets with a 'region' filter applied for a list of dataflow identifiers.

    Parameters:
    - dataflow_identifiers (list of str): List of dataflow identifiers.
    - region (str): The region to filter the data.

    Returns:
    - list: A list of dataframes for datasets where the 'region' filter was applicable.
    """
    retrieved_dataframes = []

    for identifier in dataflow_identifiers:
        try:
            ds = discovery.DataSet(dataflow_identifier=identifier)
            available_dimensions = ds.dimensions_info().dimension.unique()

            if "region" in available_dimensions:
                ds.set_filters(region=region)
                df = retrieval.get_data(ds)
                retrieved_dataframes.append(df)

        except Exception as e:
            print(f"An error occurred while processing {identifier}: {e}")

    return retrieved_dataframes

# Example usage
dataflow_identifiers = ["139_176", "200_300", "123_456"]  # Replace with actual identifiers
region = "your_region"
data_frames = retrieve_datasets_with_region_filter(dataflow_identifiers, region)
fraFerrari99 commented 8 months ago

Yeah, I also thought about a solution like this but I hope (it was a dream) to do this thing with the first call that is made to the ISTAT api and recover quickly all the filtered datasets.

Thank you so much for the quick answer and for your great work!