✨ [Feature request] - Add Seshat API example notebook showing prediction of complexity characteristics

edwardchalstrey1 commented 1 month ago

Description of Improvement

Initial Hypotheses/ideas:

Predict something, use sklearn or another ML package.
Can we predict whether a polity is likely to have variable A, given the presence of variables B, C & D

After having read this paper:

They fitted a predictive model based on the Complexity Characteristics (CCs) of most of the world regions (training set), then were able to use it to predict the CCs of North America (test set)
Their most useful Principal Component ("PC") called "PC1" (unsure how calculated) shows general increase across polities or regions over time
"The tight relationships between different CCs provide support for the idea that there are functional relationships between these characteristics that cause them to coevolve"

Notebook idea: Rather than replicating the Principal Component analysis, which Matilda is doing, a simpler ML notebook could involve:

loading the data for several CCs that the paper says are linked
Training a model to predict one CC based on others
Evaluating the performance of the model

Dependencies

No response

Technical Notes

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

Definition of Done

[ ] The feature has been developed on a feature branch.
[ ] A pull request has been created for the feature branch to be merged into the main branch.

kallewesterling commented 2 days ago

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

edwardchalstrey1 commented 1 day ago

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

@kallewesterling Ok good idea - it might be worth coordinating with @matildaperuzzo on that as she may be writing code which creates data in that format after downloading via the API - but if it can already be retrieved that way, this seems better

kallewesterling commented 1 day ago

Yeah -- Django has ways of optimising queries for Postgres so it's definitely worth us looking into it. @matildaperuzzo would you be able to share your code somehow, you think?

matildaperuzzo commented 1 day ago

In the code that I write the first thing I do after downloading the data is to group it by polity. I can imagine being able to pull data by polity would have been useful when writing my script and definitely could help debugging or for projects that are only concerned with specific regions. For the making of the template that I spoke of in the workshop the only change would have been to gather by polity then sort by variable rather than gather by variable then sort by polity. And given I don't want to download all variables every time but I do want to download all polities the current way is more effective. My code is in the SeshatDatasetAnalysis git folder in the Template.py file. The code below is a snippet showing the function add_dataset which is called for every downloaded variable, inside it splits the dataset by polity and then adds each polity to the template in the row for the variable.

` def add_dataset(self, key, url):

    # check if the dataset is already in the dataframe
    if key in self.template.columns:
        print(f"Dataset {key} already in dataframe")
        return

    # download the data
    tic = time.time()
    df = download_data(url)
    toc = time.time()
    print(f"Downloaded {key} dataset with {len(df)} rows in {toc-tic} seconds")
    if len(df) == 0:
        print(f"Empty dataset for {key}")
        return

    variable_name = df.name.unique()[0].lower()
    range_var =  variable_name + "_from" in df.columns
    col_name = key.split('/')[-1]
    self.add_empty_col(col_name)
    polities = self.template.PolityID.unique()

    for pol in polities:

        pol_df = df.loc[df.polity_id == pol]
        if pol_df.empty:
            continue
        self.add_polity(pol_df, range_var, variable_name, col_name)

    self.perform_tests(df, variable_name, range_var, col_name)
    print(f"Added {key} dataset to template")

`

Seshat-Global-History-Databank / seshat_api