UW-Macrostrat / macrostrat

A geological data platform for exploration, integration, and analysis
Apache License 2.0
4 stars 1 forks source link

MariaDB to PostgreSQL data variance tracker #81

Open amyfromandi opened 3 months ago

amyfromandi commented 3 months ago

Following up from Issue #60, we migrated the MariaDB data into a new PostgreSQL macrostrat_temp database. Next, we counted and compared rows between macrostrat_temp and the already existent macrostrat database in PostgreSQL. Below is the summary of variances as well as their solutions. Our goal is to retain and not delete any data across both macrostrat (PostgreSQL) and macrostrat_temp (MariaDB) db's.

1. macrostrat tables not found in macrostrat_temp

['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', 'temp_rocks', 'temp_names', 'unit_lith_atts']

2. macrostrat columns not found in macrostrat_temp

table: {'column_name'}: lookup_unit_intervals: {'best_interval_id'}, strat_tree: {'child', 'parent'}, units: {'notes'}, col_areas: {'wkt'}, cols: {'poly_geom', 'wkt', 'notes'}, intervals: {'rank'}, measuremeta: {'geometry'}

amyfromandi commented 2 months ago

These are the final results of the MariaDB migration within development!

Comparing macrostrat_temp (MariaDB) to macrostrat_temp.

Checking table counts...

SUCCESS: 0 All tables in macrostrat_temp (MariaDB) exist in macrostrat_temp.

SUCCESS: 4 macrostrat (PostgreSQL) tables succesfully copied into macrostrat_temp to retain data!
['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names']

Checking row counts...

SUCCESS: All row counts in all tables are the same in macrostrat_temp (MariaDB) and macrostrat_temp!

Checking column counts...

SUCCESS: Columns for 5 tables successfully copied over from macrostrat (PostgreSQL) into macrostrat_temp, to retain data!
lookup_unit_intervals                 19        20       +1
col_areas                              4         5       +1
cols                                  13        15       +2
intervals                              8         9       +1
measuremeta                           14        15       +1

Comparing macrostrat_temp to macrostrat (PostgreSQL).

Checking table counts...

SUCCESS: 40 macrostrat_temp tables copied over from MariaDB that do not exist in macrostrat (PostgreSQL). This confirms data retention!
[
    'canada_lexicon_dump',
    'interval_boundaries_scratch',
    'lookup_measurements',
    'minerals',
    'offshore_fossils',
    'pbdb_liths',
    'uniquedatafiles2',
    'unit_equiv',
    'unit_dates',
    'unit_liths_atts',
    'colors',
    'col_areas_6april2016',
    'col_equiv',
    'col_notes',
    'interval_boundaries',
    'measuremeta_cols',
    'offshore_baggage',
    'offshore_baggage_units',
    'offshore_hole_ages',
    'offshore_sections',
    'offshore_sites',
    'pbdb_intervals',
    'pbdb_matches',
    'rockd_features',
    'ronov_sediment',
    'stats',
    'strat_names_lookup',
    'structures',
    'structure_atts',
    'tectonics',
    'temp_areas',
    'units_datafiles',
    'unit_boundaries_backup',
    'unit_boundaries_scratch',
    'unit_boundaries_scratch_old',
    'unit_contacts',
    'unit_measures_pbdb',
    'unit_notes',
    'unit_seq_strat',
    'unit_tectonics'
]

SUCCESS: 3 macrostrat (PostgreSQL) tables did not copy into macrostrat_temp. These tables are irrelevant and do not need to be retained.
['temp_rocks', 'temp_names', 'unit_lith_atts']

Checking row counts...

SUCCESS: Row counts are greater in macrostrat_temp rather than macrostrat (PostgreSQL) for 26 tables, indicating data retention from Mariadb!
autocomplete                       58599     58488     +111
col_areas                           5357      5351       +6
lookup_strat_names                 51229     51214      +15
lookup_units                      127229    119462    +7767
lookup_unit_attrs_api             127228    119461    +7767
strat_names                        51229     51210      +19
unit_strat_names                   28322     28316       +6
units                             127229    119508    +7721
unit_econs                          3157      3148       +9
unit_environs                     134617    125138    +9479
unit_liths                        189907    173513   +16394
intervals                           1715      1626      +89
refs                                 213       210       +3
cols                                5654      5651       +3
col_groups                           352       350       +2
col_refs                            5716      5645      +71
liths                                212       207       +5
lookup_unit_intervals             127229    119462    +7767
strat_names_places                 50321     50288      +33
sections                           12813     12713     +100
strat_tree                         29467     17519   +11948
timescales                            37        31       +6
timescales_intervals                2194      1970     +224
units_sections                    127481    119762    +7719
unit_boundaries                   135708     53799   +81909
unit_measures                     105307    105048     +259

Checking column counts...

SUCCESS: Columns for 12 are greater in macrostrat_temp rather than macrostrat (PostgreSQL). This indicates data retention!
col_areas                              5         4       +1
lookup_strat_names                    28        22       +6
strat_names                            9         5       +4
unit_strat_names                       4         3       +1
units                                 15        13       +2
unit_environs                          7         5       +2
intervals                              9         8       +1
cols                                  15        16       -1
environs                               6         5       +1
lith_atts                              5         4       +1
measures                              13        12       +1
sections                               6         2       +4
strat_tree                             6         4       +2
{'places': (0, 0)}
places                                 0         0       +0

SUCCESS: Notes column exists macrostrat but NOT in macrostrat_temp for units. macrostrat_temp.units.notes is its own table from Mariadb.

SUCCESS: Notes column exists macrostrat but NOT in macrostrat_temp for cols. macrostrat_temp.cols.notes is its own table from Mariadb.

SUCCESS: All columns in macrostrat exist in macrostrat_temp for these tables:
['col_refs', 'lookup_unit_attrs_api', 'lookup_unit_intervals', 'strat_names_meta', 'sections', 'unit_econs', 'lookup_strat_names', 'measures', 'projects', 'timescales', 'strat_tree', 'refs', 'unit_liths', 'lookup_units', 'measurements', 'units', 'autocomplete', 'col_areas', 'unit_strat_names', 'unit_environs', 'cols', 'intervals', 'lith_atts', 'timescales_intervals', 'unit_boundaries', 'econs', 'environs', 'units_sections', 'unit_measures', 'strat_names', 'lookup_unit_liths', 'liths', 'concepts_places', 'strat_names_places', 'col_groups', 'measuremeta', 'places']

check-data completed!