Seshat-Global-History-Databank / seshat_api

A Python package for interacting with the Seshat API.
MIT License
0 stars 1 forks source link

✨ [Feature request] - Add Seshat API example notebook showing prediction of complexity characteristics #17

Open edwardchalstrey1 opened 1 month ago

edwardchalstrey1 commented 1 month ago

Description of Improvement

Initial Hypotheses/ideas:

After having read this paper:

Notebook idea: Rather than replicating the Principal Component analysis, which Matilda is doing, a simpler ML notebook could involve:

  1. loading the data for several CCs that the paper says are linked
  2. Training a model to predict one CC based on others
  3. Evaluating the performance of the model

Dependencies

No response

Technical Notes

Definition of Done

kallewesterling commented 2 days ago

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

edwardchalstrey1 commented 1 day ago

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

@kallewesterling Ok good idea - it might be worth coordinating with @matildaperuzzo on that as she may be writing code which creates data in that format after downloading via the API - but if it can already be retrieved that way, this seems better

kallewesterling commented 1 day ago

Yeah -- Django has ways of optimising queries for Postgres so it's definitely worth us looking into it. @matildaperuzzo would you be able to share your code somehow, you think?

matildaperuzzo commented 1 day ago

In the code that I write the first thing I do after downloading the data is to group it by polity. I can imagine being able to pull data by polity would have been useful when writing my script and definitely could help debugging or for projects that are only concerned with specific regions. For the making of the template that I spoke of in the workshop the only change would have been to gather by polity then sort by variable rather than gather by variable then sort by polity. And given I don't want to download all variables every time but I do want to download all polities the current way is more effective. My code is in the SeshatDatasetAnalysis git folder in the Template.py file. The code below is a snippet showing the function add_dataset which is called for every downloaded variable, inside it splits the dataset by polity and then adds each polity to the template in the row for the variable.

` def add_dataset(self, key, url):

    # check if the dataset is already in the dataframe
    if key in self.template.columns:
        print(f"Dataset {key} already in dataframe")
        return

    # download the data
    tic = time.time()
    df = download_data(url)
    toc = time.time()
    print(f"Downloaded {key} dataset with {len(df)} rows in {toc-tic} seconds")
    if len(df) == 0:
        print(f"Empty dataset for {key}")
        return

    variable_name = df.name.unique()[0].lower()
    range_var =  variable_name + "_from" in df.columns
    col_name = key.split('/')[-1]
    self.add_empty_col(col_name)
    polities = self.template.PolityID.unique()

    for pol in polities:

        pol_df = df.loc[df.polity_id == pol]
        if pol_df.empty:
            continue
        self.add_polity(pol_df, range_var, variable_name, col_name)

    self.perform_tests(df, variable_name, range_var, col_name)
    print(f"Added {key} dataset to template")

`