GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
549 stars 87 forks source link

Perform Keyword Analysis on datasets available on catalog.data.gov #4068

Closed nickumia-reisys closed 1 year ago

nickumia-reisys commented 1 year ago

User Story

In order to identify Subject Areas, the data.gov User Engagement team wants to capture the most used keywords for datasets and the number of datasets with each keyword.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

nickumia-reisys commented 1 year ago

Initial Analysis Complete

Key notes:

Keywords that appear more than 1000 times ```bash 'north pacific ocean': 1002 'national data buoy center': 1004 'moored buoy': 1005 'nm': 1007 'chemistry': 1008 'dart': 1009 'ndbc': 1009 'state-of-louisiana': 1010 'c-man': 1014 'ctdtmp': 1015 'school': 1018 'tennessee': 1019 'profile': 1026 'coral': 1031 'telepresence': 1032 'north-carolina': 1035 'r337': 1039 'okeanos': 1041 'stream': 1045 'usgs national water information system (nwis)': 1045 'scs': 1047 'cartography': 1051 'wetlands': 1052 'new jersey': 1053 'coastal-processes': 1054 'quality': 1056 'geographic-information-system': 1060 'protection': 1070 'michigan': 1078 'ocean waves': 1083 'doc/noaa/nos/orr': 1084 'office of response and restoration': 1085 'nasa': 1092 'montana': 1093 'date': 1103 'boundary': 1104 'discrete measurement': 1104 'wyoming': 1106 'oer': 1109 'health': 1111 'geothermal': 1112 'energy': 1113 'platform_orientation': 1113 'sea surface temperature': 1116 'philipine-islands': 1119 'distributions': 1131 'geographic cell': 1131 'shoreline mapping program': 1132 'coastal mapping program': 1133 'national shoreline': 1133 'undersea': 1135 'underwater': 1138 'geomorphology': 1141 'explorer': 1142 'depth status_flag': 1146 'eastward_sea_water_velocity status_flag': 1146 'latitude status_flag': 1146 'longitude status_flag': 1146 'northward_sea_water_velocity status_flag': 1146 'time status_flag': 1146 'restoration': 1159 'idaho': 1161 'united-states-of-america': 1173 'reef': 1177 'platform_pitch_angle': 1186 'platform_roll_angle': 1186 'water column mapping system': 1187 'wcms': 1187 'philippines': 1188 'weather': 1190 'expedition': 1195 'johnson-space-center': 1196 'printed-maps': 1199 'ames-research-center': 1201 'aircraft': 1209 'sea_water_density status_flag': 1211 'wetland': 1211 'north carolina': 1215 '2006 tiger second edition': 1216 'census data': 1216 'tiger data': 1216 'gulf-of-mexico': 1224 'sea_water_temperature status_flag': 1226 'sea_water_pressure status_flag': 1229 'orthophoto': 1230 'delaware': 1251 'harmonic constituents': 1253 'rain fall': 1253 'water level predictions': 1254 'connecticut': 1255 'sea_water_electrical_conductivity status_flag': 1257 'arizona': 1259 'identifier': 1259 'pdf': 1269 'doqq': 1275 'channel': 1277 'tao': 1277 'floodplain mapping': 1284 'utah': 1286 'station': 1288 'jet-propulsion-laboratory': 1291 'spectral-engineering': 1304 'sea-floor-characteristics': 1308 'visibility': 1323 'sea_water_salinity': 1336 'ecosystem': 1338 'human dimensions': 1338 'mapping': 1343 'sea_water_speed': 1344 'imagery': 1358 'langley-research-center': 1359 'exploration': 1360 'natural-resources': 1361 'remote-sensing': 1361 'waves': 1369 'groundwater': 1374 'chlorophyll': 1377 'relative humidity': 1382 'soils': 1384 'acoustic scattering': 1388 'sst': 1393 'pelagic': 1399 'marine': 1400 'whcmsc': 1402 'river_discharge': 1435 '1-percent-annual-chance flood': 1439 'technology': 1452 'colorado': 1466 'atlantic-ocean': 1467 'precipitation': 1476 'seawater': 1480 'ocean chemistry': 1487 'new york': 1491 'wind_speed_of_gust': 1495 'nevada': 1499 'coast and geodetic survey': 1502 'goddard-space-flight-center': 1503 'coastal base map': 1503 'coastal zone map': 1503 'glenn-research-center': 1505 'environmental monitoring': 1510 'multibeam': 1512 'volcanic-eruption-forecasting': 1519 'stewardship': 1520 'new-york': 1523 'volcanic-ash': 1529 'tp-sheet': 1532 't-sheet': 1540 'marsh': 1555 'woods-hole-coastal-and-marine-science-center': 1560 'wisconsin': 1562 'location': 1564 'lake-county-illinois': 1564 'transportation': 1593 'species': 1603 'marine-geophysics': 1606 'wetland-ecosystems': 1612 'geophysics': 1655 'ocean currents': 1671 'georgia': 1673 'marine ecosystems': 1675 'alabama': 1693 'western pacific ocean': 1695 'census': 1708 'oxygen': 1717 'surface': 1719 'relative_humidity': 1730 'climate': 1743 'mississippi': 1744 'datum': 1756 'marine-geology': 1762 'coastal processes': 1771 'authcdfw': 1771 'air_pressure': 1807 'autonomous underwater vehicles': 1810 'auvs': 1810 'seaglider': 1811 'pennsylvania': 1823 'maine': 1844 'california-department-of-fish-and-wildlife': 1850 'cdfw': 1850 'dem': 1854 'currents': 1887 'noaa-navy sanctuary soundscapes monitoring project': 1888 'dod/usnavy': 1889 'sanctsound': 1889 'u.s. department of defense': 1894 'earth science oceans': 1897 'u.s. navy': 1897 'ambient noise': 1899 'passive acoustic recorder': 1899 'recorders/loggers': 1902 'hydrophones': 1903 'gis': 1917 'fixed observation stations': 1918 'gulf of mexico': 1941 'marine habitat': 1955 'land-surface': 1956 'animals/invertebrates': 1960 'ocean carbon and acidification data system (ocads) project': 1974 'cetaceans': 1975 'ocean acidification data stewardship (oads) project': 1975 'marine environment monitoring': 1977 'ocean carbon data system (ocads) project': 1978 'hydrology': 2011 'boundaries': 2050 'doc/noaa/nos/nms': 2058 'national marine sanctuaries': 2063 'california-natural-resources-agency': 2074 'county': 2088 'land surface': 2088 'water pressure': 2098 'us': 2099 'science': 2104 'mammals': 2104 'meteorology': 2108 'animals/vertebrates': 2121 'national-geospatial-data-asset': 2128 'ecosystems': 2128 'caopendata': 2145 'texas': 2160 'height': 2187 'slocum': 2192 'underwater glider': 2192 'spray': 2194 'glider': 2210 'vegetation': 2223 'water_surface_height_above_reference_datum': 2243 'doc/noaa/nmfs': 2258 'wmo': 2261 'flood hazard data': 2269 'usa': 2274 'wildlife': 2278 'north-america': 2279 'water level': 2284 'barometric pressure': 2286 'biological classification': 2304 'wind_from_direction': 2338 'lidar': 2390 'benthic': 2429 'ocean pressure': 2459 'geology': 2493 'region 04': 2495 'elevation': 2528 'oregon': 2567 'coastal barrier resources system': 2571 'cbrs': 2572 'sea_water_density': 2579 'trajectory': 2584 'massachusetts': 2590 'wind_speed': 2602 'coastal': 2631 'virginia': 2632 'coastal maps': 2632 'noaa shoreline': 2633 'coastal survey': 2634 'water oceans and coasts theme': 2639 'wind': 2641 'north atlantic ocean': 2658 'geospatial-datasets': 2667 'coastal flooding': 2672 'great lakes': 2679 'coastal-and-marine-geology-program': 2706 'cmgp': 2719 'data': 2722 'habitat': 2746 'northward_sea_water_velocity': 2748 'national geospatial data asset': 2760 'eastward_sea_water_velocity': 2760 'hawaii': 2781 'air temperature': 2821 'sea_water_pressure': 2834 'temperature': 2890 'active': 2908 'maryland': 2915 'washington': 2928 'winds': 2976 'biota': 2998 'atlantic ocean': 2998 'topography': 3033 'density': 3061 'fish': 3152 'water column': 3185 'aquatic sciences': 3211 'salinity/density': 3233 'linearfeature': 3234 'rreservation or off-reservation trust land indicator': 3235 'maftiger feature class code': 3237 'primaryalternate code': 3237 'area hydrography identifier': 3237 '115th congressional district code': 3237 'public use microdata area codeland/water flag': 3238 'feature names': 3238 'prefix direction code': 3238 'prefix qualifier code': 3238 'prefix type code description': 3238 'suffix direction code': 3238 'suffix qualifier code': 3238 'suffix type code': 3238 'land/water flag': 3238 'fips place code for all places': 3239 'subminor civil division fips code in puerto rico': 3239 '5 digit zip code tabulation area code': 3240 'alaska native regional corporation fips code': 3240 'american indian/alaska native/native hawaiian areas census code': 3240 'census tract number': 3240 'consolidated city fips code': 3240 'county subdivision fips code': 3240 'elementary school district local education agency code': 3240 'legislative session year': 3240 'metropolitan statistical area/consolidated metropolitan statistical area fips code': 3240 'new england county metropolitan area fips code': 3240 'primary metropolitan statistical area fips code': 3240 'secondary school district local education agency code': 3240 'state legislative district lower chamber code': 3240 'state legislative district upper chamber code': 3240 'tabulation block number': 3240 'tribal subdivision code': 3240 'unified school district local education agency code': 3240 'urban area code': 3240 'urban growth area code': 3240 'imagerybasemapsearthcover': 3243 'railways': 3246 'sea_water_practical_salinity': 3258 'permanent face id': 3299 'air_temperature': 3306 'aquatic ecosystems': 3322 'feature': 3326 'linear': 3333 'ocean acoustics': 3340 'block group': 3409 'doc/noaa/nos/ngs': 3416 'national geodetic survey': 3416 'riverine flooding': 3431 'sea_water_electrical_conductivity': 3623 'dfirm database': 3655 'floodway': 3671 'base flood elevation': 3683 'fema flood hazard zone': 3690 'nfip': 3696 'sfha': 3706 'sea': 3707 'flood insurance rate map': 3712 'special flood hazard area': 3712 'louisiana': 3713 'firm': 3727 'fisheries': 3855 'number': 3877 'ocean temperature': 3972 'inlandwaters': 4183 'name': 4189 'vertical location': 4231 'shoreline': 4298 'national marine fisheries service': 4303 'new mexico': 4381 'florida': 4408 'u-s-geological-survey': 4441 'pacific ocean': 4647 'u.s.': 4811 'state or equivalent entity': 4892 'ngda': 4915 'usgs': 4968 'california': 5334 'polygon': 5613 'conductivity': 5613 'water': 5666 'water temperature': 5692 'dfirm': 5897 'digital flood insurance rate map': 5918 'depth': 5990 'united-states': 6222 'salinity': 6269 'topological faces': 6544 'linear feature': 6564 'sea_water_temperature': 6629 'global positioning system/inertial measurement unit': 6692 'gps/imu': 6692 'msbs': 6693 'multibeam swath bathymetry system': 6693 'biology': 6700 'positioning/navigation': 6726 'gps receivers': 6749 'multibeam mapping system': 6833 'mbes': 6983 'gps': 7058 'geoscientificinformation': 7078 'altitude': 7211 'passive remote sensing': 7494 'earth remote sensing instruments': 7756 'earth-science': 8162 'united states': 8176 'national ocean service': 8268 'completed': 8845 'alaska': 9616 '5-digit zip code': 9692 'from house number': 9692 'side indicator flag': 9692 'to house number': 9692 'zip +4 code': 9692 'table': 9779 'road feature': 9840 'roads': 10023 'street centerline': 10186 'address range': 10192 'atmosphere': 10228 'sound navigation and ranging': 10904 'sonar': 11184 'biosphere': 11560 'profilers/sounders': 12476 'in situ/laboratory instruments': 12577 'permanent edge id': 12932 'environment': 13505 'time': 13857 'oceanography': 13988 'longitude': 14018 'latitude': 14020 'acoustic sounders': 14091 'hydrography': 14379 'county gnis code': 16161 'state gnis code': 16222 'earth-science-oceans-marine-sediments-sediment-composition': 16335 'hydrographic-surveys-for-selected-locations-within-the-united-states-hydro_bathy_2006': 16415 'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-imu-glob': 16423 'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-receiver': 16423 'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders': 16423 'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-mbes-multibeam-mapping-syst': 16423 'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-msbs-multibeam-swath-bathym': 16423 'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-sonar-sound-navigation-and-': 16423 'in-situ-ocean-based-platforms-ships': 16476 'earth-science-oceans-bathymetry-seafloor-topography-water-depth': 16661 'earth-science-oceans-bathymetry-seafloor-topography-seafloor-topography': 16673 'doc/noaa/nesdis/ngdc': 16865 'national geophysical data center': 16865 'earth-science-oceans-bathymetry-seafloor-topography-bathymetry': 16983 'doc-noaa-nesdis-ngdc-national-geophysical-data-center': 17136 'u-s-department-of-commerce': 17683 'hydrographic surveys for selected locations within the united states (hydro_bathy_2006)': 18318 'sediment composition': 18332 'marine sediments': 18345 'county fips code': 19411 'state fips code': 19471 'ships': 21804 'water depth': 22292 'doc/noaa/nesdis/ncei': 23135 'national centers for environmental information': 23135 'seafloor topography': 23601 'in situ ocean-based platforms': 23985 'bathymetry/seafloor topography': 24168 'continent': 24411 'north america': 24475 'united states of america': 25069 'bathymetry': 26113 'u.s. department of commerce': 31085 'ocean': 33461 'county or equivalent entity': 35946 'oceans': 40376 'nesdis': 40599 'earth science': 40754 'noaa': 51569 ```
nickumia-reisys commented 1 year ago

Scripts used:

FuhuXia commented 1 year ago

This api call give all tags

curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1" | jq '.result.facets.tags'
jbrown-xentity commented 1 year ago

And the following URL gives you every tag with > 1000 datasets: https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=1000&rows=0

nickumia-reisys commented 1 year ago

First pass @ grouping keywords

Since we don't have a word model specifically trained for data.gov/open data, I used the off-the-shelf Wordnet to find the shortest distance (or similarity) between words. More similar words would make sense to group together. The idea was to define similarity as our parameter and see what groups appear from the data. This is contrary to the other approach of trying to select N number of groups and then forcing words into one of the N groups.

To help breakdown the complex keywords into simpler words that existed in Wordnet, the following preprocessing was done:

It should be noted that this inherently caused some contextual meaning loss. "North pacific ocean" is a specific area of the pacific ocean that might be more relevant if we cared about making sub-categories in our envrionmental/oceanographic/weather group; however, since this granularity was not as important to capture. A word that would lose considerably more context is "North Carolina" which is the name of a state. Since this analysis was also going to be filtered through human eyes, I thought this was also an acceptable loss. We'd be able to understand that Carolina may or may not belong in a particular group. Another example of context loss is 'lake-county-illinois': 1564. Very sensibly, lake county illinois does not mean there's data about lakes or counties. I'm not sure if this is an acceptable error; however, there is clearly location based data and that context will be accounted for by us as humans as well.

The first pass used the ideas mentioned above scripted here to analyze the top 1000 most frequent keywords. Using a distance of 5 between words, the groups in preliminary_lt_5 were created. Note that many words were custom acronyms about agencies or specific abbreviations based on developments since Wordnet, so their relevance could not be placed.

preliminary_lt_5.txt

preliminary_lt_8.txt

I will run a few more permutations of this algorithm, but I don't think it's going to give as much insight as we'd hope.

Next steps:

I'm going to train a basic word model based on the catalog. The premise for this will be tags on a single dataset are, by implementation, similar. The more datasets tags appear in together, the more similar the tags are. In this way, the word model does not need to know definitions of words or the relationships between them. Words that were excluded in the previous analysis will be included here. Also, keywords do not need to be broken down. If complex tags were created for a reason, they can be preserved.

The only meddling that I will do is weed out the nonsense tags that even we, as humans, would not be able to make sense of. This would include tags such as !c07, (58aa9402 leg 2) and 01asr02.

nickumia-reisys commented 1 year ago

Another note about Wordnet: It would have been more useful if we could have isolated the exact meaning of each word and pull that synset from Wordnet as it would have been a more meaningful distance. Since I didn't know which sense the word was referring to, I had no choice but to average all sense of the word which muddled the results too much too.

nickumia commented 1 year ago

I'm very skeptical of the approach I'm about to document. I may have made errors along the way that will greatly skew the usefulness or accuracy of the results (and I hope the team will review this and let me know if there are any gaps or errors). All of the code is in a gist: https://gist.github.com/nickumia/4f034ae951349a9dea5fda999f935405

Key Motivation Points

Specific Implementation Details

Results

This will take some time to fully complete. Also... we might want to limit the results in someway because it will group 93691 keywords into N groups. 93K words are a lot to parse. As an initial pass, I ran it on the keywords that appear 1000 times or more (483 words). This created 96 groups. 9 of them have more than 1 word in them and would therefore be the most helpful in coming up with a category.

output_top_483.txt

I'm going to optimize three parameters to determine the most diverse distribution of words (or the most groups with more than one word). These two parameters are: (1) the relatedness tolerance between words in a group, (2) when to create a new group, (3) How many words to include in the analysis.

nickumia-reisys commented 1 year ago

While these images aren't exactly read-able, I thought it would be interesting for people to look at.

The first image is a graph of 2000 random connections between keywords on datasets. It has much clearer cluster definitions (but it's not meaningful because the number of times those keywords appear on catalog is not meaningful).
2000_random_words

The second image does not have as many well-defined clusters, but it is a graph of the connections between keywords on catalog that appear at least 100 times (483 keywords). This means that these keywords are used together on at least 100 datasets.
graph_1000_100

I want to get larger graphs, but the graphs would be even less readable and the time to compute would be super long.

The total number of connections that I have tracked: 500473 The total number of connections between the top 483 keywords and other words: 238162 The total number of connections in the "graph_1000_100.png": 1693

My next logical step would be to do text summarization of these groups of words (if we want less work as humans). Or just get the list of words in a readable format, so that we can parse it as humans.

nickumia-reisys commented 1 year ago

References for the graph visualization above:

Other references for previous work:

nickumia-reisys commented 1 year ago

As a summary of where this leaves us:

The job of grouping datasets into logical groups that improves discoverability and accessibility is not a simple one. I have explored two paths in the above analyses: (1) Using an off-the-shelf Word Model to process the tags and perform a similarity comparison, (2) Building a custom Word Model to process the tags and highlight relational similarities to group tags. Both algorithms used tags as the driving point to create groups. As @jbrown-xentity noted, tags from an agency are typically created all by the same publisher. From this perspective, all of the tags from one publisher might have a biased similarity towards the publisher and not the dataset itself. I don't think this is entirely true, but a valid concern nonetheless.

Proper analysis would be to take all of the non-standard text from a dataset (title, description, tags and any unique extras fields) to build the model which would have a more complete picture of datasets. Even with this, the descriptions might also suffer from writer bias, so this is not a foolproof method either. The focus on tags in this ticket was: (1) to fine-tune scope and (2) to focus on the algorithm design via data discovery. I think, regardless of writer biases, we only have the data that we have, so if writer bias is an inhibiting factor, we need to raise that to the Agencies and make sure they intended for the datasets to be worded the way that they are and that their wording is accurate and consistent to the data that its describing. This collaboration is not easy, but is a necessary part if we want to remove biases from our analysis.

Many points have been mentioned in the previous comments. As the key takeaway points:

Appendix A. Word Similarity

Appendix B. Word Taxonomies

The federal government itself is a taxonomy. There is the Executive Branch. Within the Executive Branch, there are a host of agencies, such as the Department of Defense. The Department of Defense then has agencies, such as the Department of the Navy. The Department of the Navy then has sub-agencies, such as NAVAIR, NAVSEA, NAVSUB, et cetera. Each of those then have divisions such as Aircraft Division or Weapons Division. While the taxonomy that exists by design of the government is helpful, it is not complete or otherwise self-describing enough to use as the sole basis of our analysis. Each agency would have their own definition for words like health, finance, education and transportation.

Creating a universal taxonomy that can be applied to such a wide range of data types and sources may not be possible; however, a system that aggregates all of the different taxonomies from each agency might be possible. Either way, we need to build a reference to understand relationships between data.

hkdctol commented 1 year ago

thanks for doing this @nickumia-reisys this is good to have for future discussion