SnowEx / snowexsql

A project to access the SnowEx database holding data from the NASA SnowEx campaign
https://snowexsql.readthedocs.io/en/latest/readme.html
GNU General Public License v3.0
23 stars 6 forks source link

Observer relationship #137

Open jomey opened 1 week ago

jomey commented 1 week ago

Question

Should we move the Observer relationship away from point and layers and add to the Site (#127)?

Think we could reduce data redundancy and increase clarity by moving the observer information away from an individual point or layer measurement and onto the site. Right now we have two link tables with lots of entries.

Moving this information was motivated by looking at the header of each pit sheet image

Where each location is visit by a team and all measurements/obs are taken by them.

For auxiliary observations like GPR/SMP/SSA/Depth transects we could still ink each measurement back to the site or create a new entry in the site table if it does not belong to one in particular.

Thoughts @micah-prime @micahjohnson150 @meganmason ?

jomey commented 1 week ago
erDiagram
  Site }|--|{ SiteObserver : "has many"
  SiteObserver }|--|{ Observer : "has many"
  Site {
    string PlotID
  }
  SiteObserver {
    int site_id
    int observer_id
  }
  Observer {
    string name
  }
  Site ||--|{ PointData : contains
  Site ||--|{ LayerData: contains
  PointData {
    int site_id
  }
  LayerData {
    int site_id
  }
micah-prime commented 1 week ago

I like the idea, but I don't think it will work. Partly, I got one of the relationships wrong in the restructure - Point should not have a relationship with Sites. Sites (as you alluded to in one of the comments) are basically just a Pits.

So since LayerData will be the only data table that needs a Site, I think the observer relationship should be left directly to PointData and ImageData, with the slight simplification that a Site can have an Observer link now so that each Layer does not need it.

jomey commented 1 week ago

My main motivation to bring this up is that we will have a lot of redundant point_observer entries. For instance, one GPR entry has currently a campaign name (currently in the site_name column in points) e.g. 'Grand Mesa' and one observer. If we use the point_observer idea, then we create a lot of entries that are duplicate for one survey.

Guess we could extend the point_observer beyond a link table and make it a new relationship like so:

erDiagram
  PointData ||--|| CampaignObservation : "has one"
  CampaignObservation ||--|| Observer : "has one"
  CampaignObservation ||--|| Campaign : "belongs to"
  ImageData ||--|| CampaignObservation : "has one"
  Site }|--|{ SiteObserver : "has many"
  Site ||--|| Campaign : "belongs to"
  SiteObserver }|--|{ Observer : "has many"
  LayerData ||--|{ Site : "has one"

  PointData {
    int campaign_observation_id
  }
  ImageData {
    int campaign_observation_id
  }
  CampaignObservation {
    int campaign_id
    int observer_id
    String type
  }

  LayerData {
    int site_id
  }
  Site {
    string PlotID
  }
  SiteObserver {
    int site_id
    int observer_id
  }
  Observer {
    string name
  }

Then we could use the Single Table Inheritance feature to use one table for image and layer data, removing data redundancy. This would be the classic and perfect use for that SQLAlchemy feature.

The points and images are currently only single entities (person or company) in the current data. We can easily extend that to many in the future if the need arises.

jomey commented 1 week ago

Continuing my train of thought since I am in the groove. Another argument I can see for the CampaignObserveration table is that we can then remove the instrument and date from the Point and Image too.

Extending my diagram from above and zooming in on the table:

erDiagram
  PointData ||--|| CampaignObservation : "has one"
  ImageData ||--|| CampaignObservation : "has one"
  CampaignObservation ||--|| Campaign : "belongs to"
  CampaignObservation ||--|| Instrument : "has one"
  CampaignObservation ||--|| MeasurementType : "has one"
  CampaignObservation ||--|| Observer : "has one"

  PointData {
    int campaign_observation_id
  }
  ImageData {
    int campaign_observation_id
  }
  CampaignObservation {
    int campaign_id
    int observer_id
    int instrument_id
    int measurement_type_id
    Date date
    String comment
    String type
  }
micah-prime commented 2 days ago

My main motivation to bring this up is that we will have a lot of redundant point_observer entries. For instance, one GPR entry has currently a campaign name (currently in the site_name column in points) e.g. 'Grand Mesa' and one observer. If we use the point_observer idea, then we create a lot of entries that are duplicate for one survey.

Guess we could extend the point_observer beyond a link table and make it a new relationship like so:

erDiagram
  PointData ||--|| CampaignObservation : "has one"
  CampaignObservation ||--|| Observer : "has one"
  CampaignObservation ||--|| Campaign : "belongs to"
  ImageData ||--|| CampaignObservation : "has one"
  Site }|--|{ SiteObserver : "has many"
  Site ||--|| Campaign : "belongs to"
  SiteObserver }|--|{ Observer : "has many"
  LayerData ||--|{ Site : "has one"

  PointData {
    int campaign_observation_id
  }
  ImageData {
    int campaign_observation_id
  }
  CampaignObservation {
    int campaign_id
    int observer_id
    String type
  }

  LayerData {
    int site_id
  }
  Site {
    string PlotID
  }
  SiteObserver {
    int site_id
    int observer_id
  }
  Observer {
    string name
  }

Then we could use the Single Table Inheritance feature to use one table for image and layer data, removing data redundancy. This would be the classic and perfect use for that SQLAlchemy feature.

The points and images are currently only single entities (person or company) in the current data. We can easily extend that to many in the future if the need arises.

I think this example almost works, except ImageData and PointData cannot be stored in the same table because of their difference in geometry type and value type. In the PointData, the data is stored under a float value and in the ImageData the data is stored under a Raster column type that geoalchemy handles under the hood.

That said, Neither of them requires the many-to-many relationship with Observer that LayerData does, so we should be able to directly link a foreign key to the Observer table. There has to be 3 separate tables for the unique data requirements, but we can get rid of the redundant PointObservers by just skipping directly to the foreign key.

jomey commented 2 days ago

I feel we should have a call about this. My last post and ER diagram should be the reference one.

I want to keep the three separate tables (ImageData, ImageData, and LayerData). As you said, each will have their own unique data requirements. Only the associated metadata can be stored in a shared fashion.