Open edwardchalstrey1 opened 1 month ago
Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables
I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?
Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables
I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?
@kallewesterling Ok good idea - it might be worth coordinating with @matildaperuzzo on that as she may be writing code which creates data in that format after downloading via the API - but if it can already be retrieved that way, this seems better
Yeah -- Django has ways of optimising queries for Postgres so it's definitely worth us looking into it. @matildaperuzzo would you be able to share your code somehow, you think?
In the code that I write the first thing I do after downloading the data is to group it by polity. I can imagine being able to pull data by polity would have been useful when writing my script and definitely could help debugging or for projects that are only concerned with specific regions. For the making of the template that I spoke of in the workshop the only change would have been to gather by polity then sort by variable rather than gather by variable then sort by polity. And given I don't want to download all variables every time but I do want to download all polities the current way is more effective. My code is in the SeshatDatasetAnalysis git folder in the Template.py file. The code below is a snippet showing the function add_dataset which is called for every downloaded variable, inside it splits the dataset by polity and then adds each polity to the template in the row for the variable.
` def add_dataset(self, key, url):
# check if the dataset is already in the dataframe
if key in self.template.columns:
print(f"Dataset {key} already in dataframe")
return
# download the data
tic = time.time()
df = download_data(url)
toc = time.time()
print(f"Downloaded {key} dataset with {len(df)} rows in {toc-tic} seconds")
if len(df) == 0:
print(f"Empty dataset for {key}")
return
variable_name = df.name.unique()[0].lower()
range_var = variable_name + "_from" in df.columns
col_name = key.split('/')[-1]
self.add_empty_col(col_name)
polities = self.template.PolityID.unique()
for pol in polities:
pol_df = df.loc[df.polity_id == pol]
if pol_df.empty:
continue
self.add_polity(pol_df, range_var, variable_name, col_name)
self.perform_tests(df, variable_name, range_var, col_name)
print(f"Added {key} dataset to template")
`
Description of Improvement
Initial Hypotheses/ideas:
After having read this paper:
Notebook idea: Rather than replicating the Principal Component analysis, which Matilda is doing, a simpler ML notebook could involve:
Dependencies
No response
Technical Notes
Definition of Done