Closed annehaley closed 11 months ago
Considering this, developers can use this branch with its new migration history while the docker file refers to the new location, and they can use other branches with their old database location. If this is merged, we will no longer need the old database locations.
The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.
The local storage of volume data is meant as a temporary measure for reviewing/testing out this PR, correct (i.e. wouldn't be merged to master)? This configuration makes it otherwise pesky to do a local wipe of the database.
Sure, this change can be removed before merging. I find it convenient that I can see the volume-mounted folders in the project location, but I understand that's just a personal preference. Besides, it deviates from the Girder 4 setup, so this change can just be scoped to this WIP.
I made another iteration on the design (aaa1c7f), with the following changes @AlmightyYakob and I have discussed:
NetworkNode
s and NetworkEdge
s, as suggested above.OriginalRegion
and DerivedRegion
just have some repeated fields instead. DerivedRegion
now has a reference to a VectorDataSource
as its map representation, since it doesn't have a reference to a Dataset
. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource
-type objects. Therefore, DerivedRegion
s need a reference to one of these, which we know should be of the Vector
type.DataCollection
model. This is a concept we should keep in mind for future iterations, but it doesn't make sense at the moment. This concept is intended for a user to curate their own collections, so this model would likely be tied to a User
object. At this stage, we disregard User
objects and have not implemented authentication, so these objects would be largely purposeless. We will need them in the future, though, and they will be easy to add in later, since they are outside of the main web of models.Additionally, DerivedRegion now has a reference to a VectorDataSource as its map representation, since it doesn't have a reference to a Dataset. We discussed how the region models themselves are internal representations, just like network nodes and edges. For visualization on the map, we should rely on DataSource -type objects. Therefore, DerivedRegions need a reference to one of these, which we know should be of the Vector type.
I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource
. This would replace DerivedRegion
entirely, and would largely contain the same fields.
I also think that rather than point to a dataset
, a Region
should refer to a VectorDataSource
(which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource
abstraction. What do you think? I've included a diagram below to represent this configuration.
---
UVDAT ER Diagram
---
erDiagram
Dataset {
CharField name
TextField description
ForeignKey city
}
DerivedRegionDataSource {
CharField name
ForeignKey city
S3FileField geojson_data
ManyToManyField source_regions
}
VectorDataSource {
ForeignKey dataset
JSONField metadata
S3FileField geojson_data
}
Region {
BigAutoField id
CharField name
JSONField properties
MultiPolygonField boundary
ForeignKey data_source
}
%% Relations
Dataset }|--|| City : city
VectorDataSource }|--|| Dataset: dataset
Region }|--|| VectorDataSource : data_source
DerivedRegionDataSource }|--|{ Region: source_regions
DerivedRegionDataSource }|--|| City: city
I think the problem with this is that derived regions can be "derived" from regions taken from multiple datasets. I've been thinking about the lingering gaps in our modeling, and I'm starting to think we need an additional data source, DerivedRegionDataSource. This would replace DerivedRegion entirely, and would largely contain the same fields. I also think that rather than point to a dataset, a Region should refer to a VectorDataSource (which in turn refers to a dataset), since I think that more closely models what we're trying to achieve with the DataSource abstraction.
I'm not sure we would be able to capture all we want to represent for the DerivedRegion
in a DataSource
. The goal for the DataSource
type is to separate the map representation from the internal representation. For the map representation of a DerivedRegion
, a VectorDataSource
has everything we need. I think the internal representation of the boundary and the original regions should be separate from this. With the current model, it's fine if the original regions are from multiple datasets; the DerivedRegion
is made independent from any datasets and a VectorDataSource
(with dataset=null
) can be made for it.
The way I think of the DataSource
concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSource
s. They can fetch Dataset
s and their FileItem
s, or they can fetch NetworkNode
s and NetworkEdge
s related to a Dataset
, or they can fetch OriginalRegion
s and DerivedRegion
s in a City
. An advanced user may be interested in the converted data and may look at the DataSource
s, but the average user would only want the main models. Thus, the DataSource
-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.
With the above consideration of how I think of DataSource
s, I think it would be more appropriate to make Chart
s as they were before. They should be a sibling to Dataset
, not a child of AbstractDataSource
. They aren't intended to be used by the map viz. I've made these changes in 1afb1f4
The way I think of the DataSource concept is as a implementation detail for visualization. Ideally, a user should be able to interact with every other model without concerning themselves with the DataSources. They can fetch Datasets and their FileItems, or they can fetch NetworkNodes and NetworkEdges related to a Dataset, or they can fetch OriginalRegions and DerivedRegions in a City. An advanced user may be interested in the converted data and may look at the DataSources, but the average user would only want the main models. Thus, the DataSource-type models shouldn't contain any data essential to the objects themselves. They should be largely hidden from the user and consumed only by the map viz.
This makes sense to me, and resolves my questions/concerns with DerivedRegion
. I guess for that specific use case, we would use the city
field to retrieve the DerivedRegions, and when it came time to view them on the map, we would use the VectorDataSource
to retrieve that data.
My only other concern then is making sure the original use case regarding time series is covered. I'm a bit fuzzy on that, maybe we should discuss this offline at some point.
Sounds good to me. I spent today on bdcece6, which fits our data into these new models. I tried combining the flood area datasets (grouping them by type) as a test for multiple DataSource
s on the same Dataset
. We would do the same thing with the time series data.
Let me know what you think. We can discuss more tomorrow or next week if you'd like.
After adjusting the populate script to fit the latest design iteration, this is a visual representation of the objects that are created by the populate script, as well as their relationships. @AlmightyYakob I remember you had mentioned that seeing how our current data fits in the models would be helpful. Let me know what you think.
Some items are gray because the object isn't actually created yet, but the expected relationships are shown. No DerivedRegions
or SimulationResult
objects are created by the populate script, but when they are created by the user, their relationships would be as shown. Also, we have not added the "Flood Time Series Simulation" data yet, but the relationships would be similar to those of the "Sea Level Rises", "10-Year Floods", and "100-year floods" Datasets.
384194e Adds a txt file which shows the printed output from running the populate script.
That visual representation of our datasets makes things clear to me, thanks! I think I'm liking the way things look and feel, my only remaining question is how we plan to go about with the implementation? The changes involved are very large, would it make sense to have a long running base branch, that we would make PRs against? That way we're not reviewing the entire diff between master every time we want to make a change. I think we'll be iterating on the new design and functionality for a while.
I agree. I think this PR is going to be scoped to the models, rest, and populate script. Other changes, such as testing and web client updates, will go in separate PRs which are based on this branch. Let's still wait on merging this, though, until after those facets are complete, too (so we don't have a non-functional master for a while).
I've already made a PR for tests (#33) and I'm starting the web client updates today.
This PR is the first in a series of three PRs in a large redesign effort. This PR is scoped to include changes to database models, their rest endpoints, their conversion functions, and the ingest process that creates objects from our sample data (the
populate
management command).Follow-up PRs address the other facets of the redesign effort. Ideally, we should prepare and review them all together so that they can be merged all at once.