Comments & questions on Luquillo CZO database used by ODM2-Admin

emiliom commented 8 years ago

@miguelcleon, I'm dumping a long list of my comments and questions in this single issue. I had already typed this in a local markdown document, so it's easiest for me to just dump it all in one place. We can split it off later. I'm pinging @lsetiawan (Don), so he's in the discussion.

When trying to restore the database backup file (using PgAdmin) on my laptop (Ubuntu 14.04, Postgresql 9.3, PostGIS 2.1), the database was installed with a couple of severe gaps. The problem boiled down to PostGIS failing to install, which in turn prevented the samplingfeatures table from installing.

PostGIS installation

To get around the PostGIS installation failure I first created a blank database using a template database that already has my standard PostGIS installation. Then I restored (using PgAdmin) from your backup file into that database. The CREATE EXTENSION PostGIS command in your backup probably checks if the extension exists already, and if it does, it doesn't try to install it.
Don and Miguel were able to install the dabase from Miguel's DB backup. In Don's case, PostGIS 2.2 was installed; in Miguel's case, 2.1.x
Unnecessary, postgis-related extensions are installed: topology and tiger extensions are in the backup database. They are not needed at all. I think they pollute the database instance and add to the complexity of installations.
ME: think about incorporating my PostGIS / samplingfeatures "additional" commands (index, dim constrain, etc)
database structure
A custom schema has been created, odm2extra. It holds 3 custom tables: Measurementresultvaluefile, featureactionsNames and processdataloggerfile. Note that the names of two of these are defined to be case-sensitive, which is inconsistent with general ODM2 PostgreSQL conventions
The custom table Measurementresultvaluefile is present in two schemas, odm2 and odm2extra. That doesn't seem right.
A custom function was created in the odm2 schema: MeasurementResultValsToResultsCountvalue
Django user and authorization management tables
- A number of tables have been added for this purpose. Per Don, they're probably all created by Django. The tables are: auth_group, auth_group_permissions, auth_permission, auth_user, auth_user_groups, auth_user_user_permissions, django_admin_log, django_content_type, django_migrations, django_session
- Ideally they should be placed in a separate schema, not under the default odm2 schema. I think it'd be best to have a dedicate schema such as odm2admin_auth or django_auth. But they could also go under odm2extra
  table usage (records, implicit conventions, issues, etc)
  
  samplingfeatures and related Sampling Features entities
Problems with the geospatial encoding (featuregeometry) for many of the records:
- No lat-lon: 136 records have (0,0) for the lat-lon encoding
- (Update: this has been fixed) Swapped lat-lon: 2 records have lat-lon values that are obviously wrong and simply swapped
- query to select these records: sql select st_astext(featuregeometry), * from odm2.samplingfeatures where st_x(featuregeometry) > -1 order by samplingfeatureid
Some sampling features seem to be generic classifiers rather than actual features with geospatial information; eg, "Tabonuco forest"
Many feature relationships are used (relatedfeatures). It'll be helpful to learn more about what these are. An interesting example is from the relationship of samplingfeatureid 768, 769, 656.
sites and specimens are not used (no records)
result types
based on record counts in <resulttype>results tables, only two result types are being used: measurement and profile. But a count of results.resulttypecv entries yields two different result types: "Measurement" (12,192) and "Time series coverage" (199).
<resulttype>results table record counts: measurementresult (205) and profileresult (12,113). No records in timeseriesresult
These inconsistencies should create problems, in principle
actions
actions are probably "coarse" and aggregated, b/c there are only 29 actions records. Here's their distribution by actiontypecv:

actiontypecv	count
Data retrieval	1
Equipment maintenance	1
Estimation	1
Instrument calibration	1
Observation	12
Specimen analysis	13

relatedactions is not used
citations
In citations table (has 159 records), the vast majority of entries include the authors and year in the title column; that column is intended to be limited to the title per se. But note that authorlists is also used extensively (has 278 records), so the author information may already be properly captured, and authors in titles could be removed.

lsetiawan commented 8 years ago

@emiliom One thing I noticed about the topology and tiger extensions is that they are part of the blank ODM2 PostgreSQL

emiliom commented 8 years ago

One thing I noticed about the topology and tiger extensions is that they are part of the blank ODM2 PostgreSQL

That's an important observation, @lsetiawan. I need to look into that and work on eliminating it there.

miguelcleon commented 8 years ago

yup, I have my time series data in the measurement results and measurement results values tables instead of the time series results and times series result values tables. That is a problem I'll need to address and is systematic issue with ODM2 Admin.

miguelcleon commented 8 years ago

sampling features 767,768, 769, 648,649,650 are 'part of' 656. They are all separate soil pits dug at a single larger site named TABOX-12.

Sounds like maybe your suggesting using the sites table for something like this?

miguelcleon commented 8 years ago

I've fixed the two records with the swapped lat-long, thank you.

emiliom commented 8 years ago

sampling features 767,768, 769, 648,649,650 are 'part of' 656. They are all separate soil pits dug at a single larger site named TABOX-12.

Thanks for that clarification. After exploring that particular relationship, I have one observation: relatedfeatures has many duplicated records. For example, the "Is part of" relationship between sampling features 767 and relatedfeatureid 656 occurs 6 times. The records are identical, except of course for the sequential, unique primary key, relationid. I'm guessing that's a messiness that was introduced inadvertently and should be cleaned up.

emiliom commented 8 years ago

This isn't a comment or question. I'm putting here this summary of samplingfeaturetypecv usage in the samplingfeatures table for future reference.

samplingfeaturetypecv	count
Excavation	324
Field area	13
Landscape classification	8
Observation well	3
Site	119
Stream gage	2
Transect	5

The table was generated using this query:

select samplingfeaturetypecv, count(*)
from odm2.samplingfeatures
group by samplingfeaturetypecv order by samplingfeaturetypecv

Update: I guess I do have one comment. I've compared these samplingfeaturetypecv occurrences to the corresponding ODM2 controlled vocabulary entries. The only that doesn't match exactly is "Landscape classification"; the official ODM2 entry is "Ecological land classification". I can see why your choice seems more natural. And it's possible (I don't know) that the ODM2 CV actually was edited, and the "Ecological land classification" entry was previously called "Landscape classification". But in general it's helpful to stick to strict compliance with ODM CV's, unless you have strong reasons not to. Also, the use of this type of Sampling Features could probably be enhanced, and we (ODM2 gang) could benefit from a discussion about a common implementation need like yours.

For reference, these are the Landscape Classification entries in use in the database: Colorado Forest, Palm Forest, Tabonuco Forest, Volcaniclastic, Quartz-diorite, Ridge -Topo Postion, Slope -Topo Postion, Valley -Topo Postion.

emiliom commented 8 years ago

Another comment about this:

sampling features 767,768, 769, 648,649,650 are 'part of' 656. They are all separate soil pits dug at a single larger site named TABOX-12.

All those sampling features are classified as "Excavation". But TABOX-12 (656) is in fact something conceptually different. It's not an Excavation per se, but rather a "Site" or "Field area" where all those individual excavations (soil pits) are found. At least that's what I'm assuming.

miguelcleon commented 8 years ago

Yeah, I haven't stayed consistent with classifying the sampling features and I'll need to double check them. Thanks for all your great feedback Emilio, this is really helpful! I'll plan to update you with a new db dump in probably about 2 weeks time does that sound good?

I'll also work on some application updates also particularly for the needed measurement result to time series result conversion that is needed.

emiliom commented 8 years ago

Thanks for all your great feedback Emilio, this is really helpful!

I'm very glad it's helpful to you.

I'll plan to update you with a new db dump in probably about 2 weeks time does that sound good?

That time frame sounds good. We still have tons to learn, both about your ODM2 implementation and ODM2-Admin per se.

miguelcleon commented 8 years ago

I thought I'd address one more of the issues you brought up for now. the table odm2.Measurementresultvaluefile can be dropped I had initially added these extra tables in odm2 schema and knowing that wasn't the best way to do things, I since moved them into odm2extra schema. I'm only using the odm2extra.Measurementresultvaluefile table now.

This should also probably be done for the django generated tables you mentioned, this may or may not be doable though as they are core aspects of the functioning of django and using multiple schema's the way I have with odm2extra doesn't seem well supported, so I have been hesitant to try to change those tables.

I should also move the custom function MeasurementResultValsToResultsCountvalue.

miguelcleon commented 8 years ago

All those sampling features are classified as "Excavation". But TABOX-12 (656) is in fact something conceptually different. It's not an Excavation per se, but rather a "Site" or "Field area" where all those individual excavations (soil pits) are found. At least that's what I'm assuming.

I changed sampling features with samplingfeatureid's 654 to 677 sampling feature type to 'sites'

emiliom commented 8 years ago

Thanks for the info on odm2.Measurementresultvaluefile vs odm2extra. I'm glad it was just a leftover table, no longer used.

Good to know about the django tables, too. If you'd like, maybe in a couple of weeks (when it's a good time for you) we can bring in Jeff Horsburgh and his team to see if they can help with Django insight. They're using Django heavily in web applications for ODM2, and they probably use credentials too. Maybe they've dealt with this schema customization issue.

lsetiawan commented 8 years ago

@miguelcleon One question about the tables in odm2extra schema. Are they created by django after defining them in models? I tried doing syncdb with django=1.6.5 and the tables get created, but then using migrate with django=1.9.x doesn't create those tables, have you run into those issues? or did you create the tables separately with SQL statements? Thanks.

miguelcleon commented 8 years ago

Yeah that is the iffy support for multiple schemas right there. I just used some extra side SQL to create the tables for django 1.9

lsetiawan commented 8 years ago

Thanks @miguelcleon. While learning django. I've gained more understanding about the models and settings for ODM2-Admin. I've tweaked the model.py to hard coded odm2 schema for each db_table variables in class Meta similar to what you did for the odm2extra.

I then created another schema called admin. I also set admin schema as the default for django. When I ran python manage.py migrate it seems like the django core tables are created under the admin schema, leaving the odm2 and odm2extra schemas alone.

So from this, I think that in order to integrate multiple schemas, one have to designate the schema in db_table variables. I tried running django 1.9.x and it seems to work, though, this method doesn't seem to work with Django 1.6.5. Thanks.

miguelcleon commented 8 years ago

Cool! Can you do a pull request into the Django1.9.7Support branch?

lsetiawan commented 8 years ago

@miguelcleon let me clean up my repo git stuff. I didn't notice the second branch. Please ignore the current pull request. Thanks.

emiliom commented 8 years ago

Regarding the schema issues:Thanks, @lsetiawan! That's great.

@miguelcleon, on your README page it says "support tested for django 1.6.5 and 1.9.x". I have no direct experience with django development. But given that so far you're the only user of ODM2-Admin, wouldn't it be easier if you limited backwards compatibility and supported only Django 1.9.x? It sounds like that would minimize development headaches. Just my 1 cent.

miguelcleon commented 8 years ago

Yeah I'm hoping to drop support for 1.6.x soon, can't upgrade a server but we are transitioning to a different one which will be on django 1.9.x

miguelcleon commented 7 years ago

Hi @emiliom the last commit I pushed today should address the issue

result types

based on record counts in results tables, only two result types are being used: >measurement and profile. But a count of results.resulttypecv entries yields two different result types: >"Measurement" (12,192) and "Time series coverage" (199).
results table record counts: measurementresult (205) and profileresult (12,113). No >records in timeseriesresult These inconsistencies **should** create problems, in principle

I converted my measurement result, and measurement result values which were really time series into time series result and time series result values. I did this for both the live LCZO and TRACE database. I'm creating back ups now and then will apply these changes on the CUAHSI instances of ODM2 Admin.

emiliom commented 7 years ago

Thanks for the heads-up, @miguelcleon. I'm still trying to digest the email exchange about time-series-like measurements, and think about the pros and cons of measurements vs time series result types in this context. I'll try to have some input tomorrow.

Regardless, consistency is good, and your change does bring about consistency.

ODM2 / CZ-Manager