Create a taxonomy dumper utility.

jpitts commented 6 years ago

The output should be a TOML file.

spease commented 6 years ago

I actually got some time to work on this today. Noticed a couple of things:

"Other tools and technologies" has a trailing space, which also leads to a trailing "-" in the kebab-case name. This will probably have to be fixed to use the tool.

https://github.com/spease/brigade-matchmaker/tree/taxonomy-dumper/components/taxonomy/scripts/dump_taxonomy

It's aggressively overengineered, but should be pretty clear about what's wrong. A lot of the validation stuff is really intended for going the other direction, for being strict about input validation, but the same types are used for the database extraction. EG it will make sure that identifiers are kebab-case without slashes (since those are needed to represent a path). It will also check if titles are title case and should issue a warning if you ask for more verbosity, but it will let you load titles. For the loader, we'd probably want to have a dry run option to let you see such warnings if you want to, and of course errors (eg if you try to load a tree that has missing pieces or specify identifiers with weird characters).

Some Qs:

What is "className" supposed to be? Is it referencing the title or the name? Is it an identifier or human readable? It seems to vary. I vaguely recall this field also being deprecated.
Do synonyms have a case?
How should the other direction work? Since I use the full path as the toml header, it's sort of redundant to have the name and the path fields. At one point I thought about using the full path to match records so it could update existing records. However, it looks like doing this atomically would require transactions which are only available in mongodb 4.0 - is that safe to assume? Without transactions, it looks like you can atomically add, update, or remove records, but not all three at the same time. So it'd have to be a three-step process at least.

One alternative approach I thought about was having it create a new collection and then renaming that collection to be the taxonomy collection (dropping the old one in the process). I'm not sure how atomic the rename-and-drop approach is, but it is at least represented by one function in the API. This might also be easier to code for. In this case I'd probably go ahead and drop the name and path fields for the TOML and just use the key to determine these.

Not sure if you still need this. If not, let me know. This at least caused me to find a couple improvements in a couple of Rust libraries that I submitted PRs for. :)

themightychris commented 6 years ago

I'd like to play around with this but I haven't been able to get my hands on a dump of the current taxonomy DB, can I find that anywhere?

spease commented 5 years ago

Sorry, I tried to get things to the point where you could load as well so you could use the tool directly, but that got stuck on a minor deserialization issue.

Here's the cleaned-up representation as TOML and JSON that is the output from the current iteration of the tool (it does either, because once the serialization trait was added to the interface, it's generally trivial to serialize to any types which support serde). This uses the key for the path rather than exposing the name and parent as separate fields. all.txt all_json.txt

Again, this is the cleaned-up data, which is slightly different from what I was originally given. Mainly there were some deviations from naming conventions that I fixed, eg there was a dangling "-" in something that was kebab case. I think there was also a circular dependency that I found earlier in development.

spease commented 5 years ago

I think I've pushed all the current stuff to my branch in github as well.

General commands are like so:

cargo run -- load json
cargo run -- load toml
cargo run -- store json
cargo run -- store toml

This will implicitly download all dependencies, compile them, and compile the util. No environment setup needed beyond having cargo/Rust installed.

There's an issue with the toml library where it isn't taking the data structure I'm storing the identifier path in. That's the main sticking issue at this point, but IIRC I wrote most of the code for saving it in the database. But I haven't tested it. So back up your database for sure before you do any storing.

TODO:

Fix that TOML bug
Check for new cyclic dependencies in the incoming schema
Consider backwards compatibility - should the tool prevent the user from deleting entries? Right now it tries to delete everything and then adds the incoming documents to the presumably-empty database. However, like I said, this is untested.
Also reconsider the methodology used to sync to the database - if it is add-only, it opens up the possibility of (1) update existing (2) add new. Another alternative would be to do all operations in a temporary database and then rename it to the old table (implicitly dropping the old table). The newer versions of mongo do add better support for transactions, I don't know how new of a version of mongo is being targeted however.

Not gonna deny there are really good arguments for moving this to python, so if somebody is gonna take that on, please let me know.

themightychris commented 5 years ago

thanks @spease !

Is it intentional that human-rights should be a top-level taxonomy or is that supposed to be a subtree of interests?

themightychris commented 5 years ago

@spease I've finally begun work prototyping a toolchain/repository for cross-brigade taxonomy indexing, I posted some notes here: https://github.com/codeforamerica/brigade-project-index/issues/9

Here is the taxonomy as I extracted it from the all_json.txt file shared earlier in this comment thread: https://github.com/codeforamerica/civic-tech-taxonomy/tree/sites/codeforsanfrancisco.org

The ask for the SF team: Open a pull request improving this method so that it can load your current taxonomy as directly as possible: https://github.com/codeforamerica/civic-tech-taxonomy/blob/2d0b72afdaa53ad3991c8daf94863feb68106add/import.js#L154

It's ok if it won't work on everyone's machine (e.g. it needs to be run locally with a mongodb connection, or use an authenticated API)
We want whatever can be a bridge to your current taxonomy practices, pushing and tracking your set's evolution via git

In our coming steps, we'll start iterating towards standard field names and formats for common attributes, and then towards processes to get everyone pulling their own taxonomy back in from this format as we gradually us PRs to move all the taxonomies towards being more completed and intermapped

designforsf / brigade-matchmaker

Create a taxonomy dumper utility. #80