Model redesign - Githubissues

annehaley commented 1 year ago

This PR is the first in a series of three PRs in a large redesign effort. This PR is scoped to include changes to database models, their rest endpoints, their conversion functions, and the ingest process that creates objects from our sample data (the populate management command).

[x] Reorganize models
[x] Rewrite Rest API for new models
[x] Rewrite populate script and tasks to fit creation of new models

Follow-up PRs address the other facets of the redesign effort. Ideally, we should prepare and review them all together so that they can be merged all at once.

[x] Write testing for CI (#33)
[x] Adjust web client to fetch and adhere to new models (#34)

jjnesbitt commented 1 year ago

Considering this, developers can use this branch with its new migration history while the docker file refers to the new location, and they can use other branches with their old database location. If this is merged, we will no longer need the old database locations.

The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.

annehaley commented 1 year ago

The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.

Sure, this change can be removed before merging. I find it convenient that I can see the volume-mounted folders in the project location, but I understand that's just a personal preference. Besides, it deviates from the Girder 4 setup, so this change can just be scoped to this WIP.

annehaley commented 1 year ago

I made another iteration on the design (aaa1c7f), with the following changes @AlmightyYakob and I have discussed:

Changing the relationship between NetworkNodes and NetworkEdges, as suggested above.
Removing the inheritance for regions. OriginalRegion and DerivedRegion just have some repeated fields instead.
Additionally, DerivedRegion now has a reference to a VectorDataSource as its map representation, since it doesn't have a reference to a Dataset. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource -type objects. Therefore, DerivedRegions need a reference to one of these, which we know should be of the Vector type.
Removing the attempt at a DataCollection model. This is a concept we should keep in mind for future iterations, but it doesn't make sense at the moment. This concept is intended for a user to curate their own collections, so this model would likely be tied to a User object. At this stage, we disregard User objects and have not implemented authentication, so these objects would be largely purposeless. We will need them in the future, though, and they will be easy to add in later, since they are outside of the main web of models.

jjnesbitt commented 1 year ago

Additionally, DerivedRegion now has a reference to a VectorDataSource as its map representation, since it doesn't have a reference to a Dataset. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource -type objects. Therefore, DerivedRegions need a reference to one of these, which we know should be of the Vector type.

I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource. This would replace DerivedRegion entirely, and would largely contain the same fields.

I also think that rather than point to a dataset, a Region should refer to a VectorDataSource (which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource abstraction. What do you think? I've included a diagram below to represent this configuration.

---
UVDAT ER Diagram
---
erDiagram
Dataset {
    CharField name
    TextField description
    ForeignKey city
}

DerivedRegionDataSource {
    CharField name
    ForeignKey city
    S3FileField geojson_data
    ManyToManyField source_regions
}

VectorDataSource {
    ForeignKey dataset
    JSONField metadata
    S3FileField geojson_data
}

Region {
    BigAutoField id
    CharField name
    JSONField properties
    MultiPolygonField boundary
    ForeignKey data_source
}
%% Relations

Dataset }|--|| City : city
VectorDataSource }|--|| Dataset: dataset
Region }|--|| VectorDataSource : data_source
DerivedRegionDataSource }|--|{ Region: source_regions
DerivedRegionDataSource }|--|| City: city

annehaley commented 1 year ago

I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource. This would replace DerivedRegion entirely, and would largely contain the same fields. I also think that rather than point to a dataset, a Region should refer to a VectorDataSource (which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource abstraction.

I'm not sure we would be able to capture all we want to represent for the DerivedRegion in a DataSource. The goal for the DataSource type is to separate the map representation from the internal representation. For the map representation of a DerivedRegion, a VectorDataSource has everything we need. I think the internal representation of the boundary and the original regions should be separate from this. With the current model, it's fine if the original regions are from multiple datasets; the DerivedRegion is made independent from any datasets and a VectorDataSource (with dataset=null) can be made for it.

The way I think of the DataSource concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSources. They can fetch Datasets and their FileItems, or they can fetch NetworkNodes and NetworkEdges related to a Dataset, or they can fetch OriginalRegions and DerivedRegions in a City. An advanced user may be interested in the converted data and may look at the DataSources, but the average user would only want the main models. Thus, the DataSource-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.

annehaley commented 1 year ago

With the above consideration of how I think of DataSources, I think it would be more appropriate to make Charts as they were before. They should be a sibling to Dataset, not a child of AbstractDataSource. They aren't intended to be used by the map viz. I've made these changes in 1afb1f4

jjnesbitt commented 1 year ago

The way I think of the DataSource concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSources. They can fetch Datasets and their FileItems, or they can fetch NetworkNodes and NetworkEdges related to a Dataset, or they can fetch OriginalRegions and DerivedRegions in a City. An advanced user may be interested in the converted data and may look at the DataSources, but the average user would only want the main models. Thus, the DataSource-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.

This makes sense to me, and resolves my questions/concerns with DerivedRegion. I guess for that specific use case, we would use the city field to retrieve the DerivedRegions, and when it came time to view them on the map, we would use the VectorDataSource to retrieve that data.

My only other concern then is making sure the original use case regarding time series is covered. I'm a bit fuzzy on that, maybe we should discuss this offline at some point.

annehaley commented 1 year ago

Sounds good to me. I spent today on bdcece6, which fits our data into these new models. I tried combining the flood area datasets (grouping them by type) as a test for multiple DataSources on the same Dataset. We would do the same thing with the time series data. Let me know what you think. We can discuss more tomorrow or next week if you'd like.

annehaley commented 1 year ago

After adjusting the populate script to fit the latest design iteration, this is a visual representation of the objects that are created by the populate script, as well as their relationships. @AlmightyYakob I remember you had mentioned that seeing how our current data fits in the models would be helpful. Let me know what you think.

Some items are gray because the object isn't actually created yet, but the expected relationships are shown. No DerivedRegions or SimulationResult objects are created by the populate script, but when they are created by the user, their relationships would be as shown. Also, we have not added the "Flood Time Series Simulation" data yet, but the relationships would be similar to those of the "Sea Level Rises", "10-Year Floods", and "100-year floods" Datasets.

384194e Adds a txt file which shows the printed output from running the populate script.

jjnesbitt commented 1 year ago

That visual representation of our datasets makes things clear to me, thanks! I think I'm liking the way things look and feel, my only remaining question is how we plan to go about with the implementation? The changes involved are very large, would it make sense to have a long running base branch, that we would make PRs against? That way we're not reviewing the entire diff between master every time we want to make a change. I think we'll be iterating on the new design and functionality for a while.

annehaley commented 1 year ago

I agree. I think this PR is going to be scoped to the models, rest, and populate script. Other changes, such as testing and web client updates, will go in separate PRs which are based on this branch. Let's still wait on merging this, though, until after those facets are complete, too (so we don't have a non-functional master for a while).

annehaley commented 1 year ago

I've already made a PR for tests (#33) and I'm starting the web client updates today.

OpenGeoscience / uvdat

Model redesign #30