Closed nickumia-reisys closed 1 year ago
Key notes:
This api call give all tags
curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1" | jq '.result.facets.tags'
And the following URL gives you every tag with > 1000 datasets: https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=1000&rows=0
Since we don't have a word model specifically trained for data.gov/open data, I used the off-the-shelf Wordnet to find the shortest distance (or similarity) between words. More similar words would make sense to group together. The idea was to define similarity as our parameter and see what groups appear from the data. This is contrary to the other approach of trying to select N
number of groups and then forcing words into one of the N
groups.
To help breakdown the complex keywords into simpler words that existed in Wordnet, the following preprocessing was done:
'north pacific ocean'
to 'north', 'pacific', 'ocean'
'doc/noaa/nesdis/ncei'
to 'doc', 'noaa', 'nesdis', 'ncei'
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-mbes-multibeam-mapping-syst'
to 'in', 'situ', 'laboratory', 'instruments', 'profiles' ...
It should be noted that this inherently caused some contextual meaning loss. "North pacific ocean" is a specific area of the pacific ocean that might be more relevant if we cared about making sub-categories in our envrionmental/oceanographic/weather group; however, since this granularity was not as important to capture. A word that would lose considerably more context is "North Carolina" which is the name of a state. Since this analysis was also going to be filtered through human eyes, I thought this was also an acceptable loss. We'd be able to understand that Carolina may or may not belong in a particular group. Another example of context loss is 'lake-county-illinois': 1564
. Very sensibly, lake county illinois
does not mean there's data about lakes or counties. I'm not sure if this is an acceptable error; however, there is clearly location based data and that context will be accounted for by us as humans as well.
The first pass used the ideas mentioned above scripted here to analyze the top 1000 most frequent keywords. Using a distance of 5
between words, the groups in preliminary_lt_5
were created. Note that many words were custom acronyms about agencies or specific abbreviations based on developments since Wordnet, so their relevance could not be placed.
I will run a few more permutations of this algorithm, but I don't think it's going to give as much insight as we'd hope.
I'm going to train a basic word model based on the catalog. The premise for this will be tags on a single dataset are, by implementation, similar. The more datasets tags appear in together, the more similar the tags are. In this way, the word model does not need to know definitions of words or the relationships between them. Words that were excluded in the previous analysis will be included here. Also, keywords do not need to be broken down. If complex tags were created for a reason, they can be preserved.
The only meddling that I will do is weed out the nonsense tags that even we, as humans, would not be able to make sense of. This would include tags such as !c07
, (58aa9402 leg 2)
and 01asr02
.
Another note about Wordnet: It would have been more useful if we could have isolated the exact meaning of each word and pull that synset from Wordnet as it would have been a more meaningful distance. Since I didn't know which sense the word was referring to, I had no choice but to average all sense of the word which muddled the results too much too.
I'm very skeptical of the approach I'm about to document. I may have made errors along the way that will greatly skew the usefulness or accuracy of the results (and I hope the team will review this and let me know if there are any gaps or errors). All of the code is in a gist: https://gist.github.com/nickumia/4f034ae951349a9dea5fda999f935405
N x N
triangular matrix).
keyword a
and keyword b
, the only useful information is that there is a single connection between keyword a
and keyword b
with a weight of 2
. It doesn't matter if the keywords were discovered keyword a -> keyword b : 1
and keyword b -> keyword a : 1
.N x N
matrix is memory-inefficient for sparse matrices, the graph can be represented by a python dictionary with keys of the N x N
indices and values of the matrix value at the referenced index.
# N x N matrix form
[[ 0 4 2],
[ 0 0 7],
[ 0 0 0]]
# Dictionary form
{ '0,1': 4,
'0,2': 2,
'1,2': 7}
catalog.data.gov
was crawled to generate a dictionary of each dataset and all of the keywords associated with that dataset. This turned out to be 93M
. Note: There were 242842 datasets obtained. This may be influenced by duplicates and/or missing datasets from cloudfront rate limiting. The number seemed close enough that I felt like these were acceptable uncertainties.
{ 'f9880479-bf5c-477c-ba8e-0651a0e054a5': ['new-york-lottery', 'powerball', 'results', 'winning'],
'4cfdbe83-666d-4e72-b8c6-31dbcdd8dbf0': ['assistance-transactions', 'banks', 'failures', 'financial-institution'],
...
}
curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=2&rows=0" | jq '.result.facets.tags' > keywords.json
N x N
graph matrix. This resulted in a 17.3G
file which is why I need to use my personal computer.13M
.This will take some time to fully complete. Also... we might want to limit the results in someway because it will group 93691
keywords into N
groups. 93K words are a lot to parse. As an initial pass, I ran it on the keywords that appear 1000 times or more (483 words). This created 96 groups. 9 of them have more than 1 word in them and would therefore be the most helpful in coming up with a category.
I'm going to optimize three parameters to determine the most diverse distribution of words (or the most groups with more than one word). These two parameters are: (1) the relatedness tolerance between words in a group, (2) when to create a new group, (3) How many words to include in the analysis.
While these images aren't exactly read-able, I thought it would be interesting for people to look at.
The first image is a graph of 2000 random connections between keywords on datasets. It has much clearer cluster definitions (but it's not meaningful because the number of times those keywords appear on catalog is not meaningful).
The second image does not have as many well-defined clusters, but it is a graph of the connections between keywords on catalog that appear at least 100 times (483 keywords). This means that these keywords are used together on at least 100 datasets.
I want to get larger graphs, but the graphs would be even less readable and the time to compute would be super long.
The total number of connections that I have tracked: 500473 The total number of connections between the top 483 keywords and other words: 238162 The total number of connections in the "graph_1000_100.png": 1693
My next logical step would be to do text summarization of these groups of words (if we want less work as humans). Or just get the list of words in a readable format, so that we can parse it as humans.
References for the graph visualization above:
Other references for previous work:
The job of grouping datasets into logical groups that improves discoverability and accessibility is not a simple one. I have explored two paths in the above analyses: (1) Using an off-the-shelf Word Model to process the tags and perform a similarity comparison, (2) Building a custom Word Model to process the tags and highlight relational similarities to group tags. Both algorithms used tags as the driving point to create groups. As @jbrown-xentity noted, tags from an agency are typically created all by the same publisher. From this perspective, all of the tags from one publisher might have a biased similarity towards the publisher and not the dataset itself. I don't think this is entirely true, but a valid concern nonetheless.
Proper analysis would be to take all of the non-standard text from a dataset (title, description, tags and any unique extras fields) to build the model which would have a more complete picture of datasets. Even with this, the descriptions might also suffer from writer bias, so this is not a foolproof method either. The focus on tags in this ticket was: (1) to fine-tune scope and (2) to focus on the algorithm design via data discovery. I think, regardless of writer biases, we only have the data that we have, so if writer bias is an inhibiting factor, we need to raise that to the Agencies and make sure they intended for the datasets to be worded the way that they are and that their wording is accurate and consistent to the data that its describing. This collaboration is not easy, but is a necessary part if we want to remove biases from our analysis.
Many points have been mentioned in the previous comments. As the key takeaway points:
>>> from nltk.corpus import wordnet
>>> a = wordnet.synsets('water')
[Synset('water.n.01'), Synset('body_of_water.n.01'), Synset('water.n.03'), Synset('water_system.n.02'), Synset('urine.n.01'), Synset('water.n.06'), Synset('water.v.01'), Synset('water.v.02'), Synset('water.v.03'), Synset('water.v.04')]
>>> a[1].hypernyms()
[Synset('thing.n.12')]
>>> a[2].hypernyms()
[Synset('element.n.05')]
>>> a[3].hypernyms()
[Synset('facility.n.01')]
>>> a[4].hypernyms()
[Synset('body_waste.n.01')]
>>> a[5].hypernyms()
[Synset('food.n.01'), Synset('liquid.n.01'), Synset('nutrient.n.02')]
>>> a[1].shortest_path_distance(a[1].hypernyms()[0])
1
>>> a[1].shortest_path_distance(a[2])
6
wordnet
was one option, there are many more)The federal government itself is a taxonomy. There is the Executive Branch. Within the Executive Branch, there are a host of agencies, such as the Department of Defense. The Department of Defense then has agencies, such as the Department of the Navy. The Department of the Navy then has sub-agencies, such as NAVAIR, NAVSEA, NAVSUB, et cetera. Each of those then have divisions such as Aircraft Division or Weapons Division. While the taxonomy that exists by design of the government is helpful, it is not complete or otherwise self-describing enough to use as the sole basis of our analysis. Each agency would have their own definition for words like health, finance, education and transportation.
Creating a universal taxonomy that can be applied to such a wide range of data types and sources may not be possible; however, a system that aggregates all of the different taxonomies from each agency might be possible. Either way, we need to build a reference to understand relationships between data.
thanks for doing this @nickumia-reisys this is good to have for future discussion
User Story
In order to identify Subject Areas, the data.gov User Engagement team wants to capture the most used keywords for datasets and the number of datasets with each keyword.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]