frankduncan commented 4 years ago

Right now we store the spreadsheet on the hard disk and again in memory. We should instead store it in postgres. This will be a relatively large refactor (relative to the small size of the codebase) as it would involved using an ORM and creating an object structure. This would be a positive refactor, as it would be easier to work with the objects.

The reasoning here is that postgres is more secure. Right now, anyone with filesystem access can look at the data, which may become a problem. It also allows us to do things like split the data off to it's own machine (like heroku), backup more cleanly, migrate more cleanly, etc, etc.

We should also move all the configuration of permissions, wikis, and whatnot that we store via Pickle on the hard disk into a database as well.

In the database, the spreadsheet row columns should be:

key (from the key_colymn)
row_no
sheet_name
raw_row
row_as_dict

We don't need to do any selecting on the row data, as searching is handled by whoosh, so we don't need to create a denormalized table where the row is broken out into a cell table that links back to this row table. We also don't need to worry about changing the rows dynamically as loading up a new spreadsheet should delete and replace them all.

frankduncan commented 4 years ago

Likely TOCs, templates and attachment configuration should be moved to postgres. For the TOCs and templates, using big blob fields is probably fine. The attachments should still be on the server (or in an external source a la #20), but the configuration of them should go in postgres too.

frankduncan commented 4 years ago

We want to use Django ORM to do this initially. I found this gist, which may be a good place to start. The goal is to remove all instances of config and pickle saving/loading from the application, replaced with a database. Then script loading current competitions into the database.

Nolski commented 4 years ago

Hey, just came across this issue. I know @YaxelPerez has been working on this but as per #45 , there seems to be some misalignment between these two issues. I have some suggestions as to how we might be able to align these two which I'll write down below. Happy to hop on a call to discuss further as well.

Suggestions

I suggest we split our data model out into as many tables as there are data types. As I understand it, we have the following persisted data types in torque

Sheet Config
Attachment Config
Proposals (which I understand to be called data in init.py)
Templates
Users (perhaps we could use the built in django model for this)
Edits (defined in DESIGN.md)

I think (unless we want to store the above configuration types in code), these should be our starting tables that we will use in postgres. We should be cognicent to potentially create new tables if we find data types come about within these tables as well. Given our coming needs with GlobalView wanting to gather information on all proposals, I suspect we should do our best to strictly define as many columns as possible (especially the columns which all proposals should have in common, no matter the competition). For data which aren't consistent across competitions, I proposed a method of dealing with that in #45 .

OpenTechStrategies / torque

Move spreadsheet storage to postgres #26

Suggestions