Open jpitts opened 6 years ago
I actually got some time to work on this today. Noticed a couple of things:
It's aggressively overengineered, but should be pretty clear about what's wrong. A lot of the validation stuff is really intended for going the other direction, for being strict about input validation, but the same types are used for the database extraction. EG it will make sure that identifiers are kebab-case without slashes (since those are needed to represent a path). It will also check if titles are title case and should issue a warning if you ask for more verbosity, but it will let you load titles. For the loader, we'd probably want to have a dry run option to let you see such warnings if you want to, and of course errors (eg if you try to load a tree that has missing pieces or specify identifiers with weird characters).
Some Qs:
One alternative approach I thought about was having it create a new collection and then renaming that collection to be the taxonomy collection (dropping the old one in the process). I'm not sure how atomic the rename-and-drop approach is, but it is at least represented by one function in the API. This might also be easier to code for. In this case I'd probably go ahead and drop the name and path fields for the TOML and just use the key to determine these.
Not sure if you still need this. If not, let me know. This at least caused me to find a couple improvements in a couple of Rust libraries that I submitted PRs for. :)
I'd like to play around with this but I haven't been able to get my hands on a dump of the current taxonomy DB, can I find that anywhere?
Sorry, I tried to get things to the point where you could load as well so you could use the tool directly, but that got stuck on a minor deserialization issue.
Here's the cleaned-up representation as TOML and JSON that is the output from the current iteration of the tool (it does either, because once the serialization trait was added to the interface, it's generally trivial to serialize to any types which support serde). This uses the key for the path rather than exposing the name and parent as separate fields. all.txt all_json.txt
Again, this is the cleaned-up data, which is slightly different from what I was originally given. Mainly there were some deviations from naming conventions that I fixed, eg there was a dangling "-" in something that was kebab case. I think there was also a circular dependency that I found earlier in development.
I think I've pushed all the current stuff to my branch in github as well.
General commands are like so:
cargo run -- load json
cargo run -- load toml
cargo run -- store json
cargo run -- store toml
This will implicitly download all dependencies, compile them, and compile the util. No environment setup needed beyond having cargo/Rust installed.
There's an issue with the toml library where it isn't taking the data structure I'm storing the identifier path in. That's the main sticking issue at this point, but IIRC I wrote most of the code for saving it in the database. But I haven't tested it. So back up your database for sure before you do any storing.
TODO:
Not gonna deny there are really good arguments for moving this to python, so if somebody is gonna take that on, please let me know.
thanks @spease !
Is it intentional that human-rights
should be a top-level taxonomy or is that supposed to be a subtree of interests
?
@spease I've finally begun work prototyping a toolchain/repository for cross-brigade taxonomy indexing, I posted some notes here: https://github.com/codeforamerica/brigade-project-index/issues/9
Here is the taxonomy as I extracted it from the all_json.txt
file shared earlier in this comment thread: https://github.com/codeforamerica/civic-tech-taxonomy/tree/sites/codeforsanfrancisco.org
The ask for the SF team: Open a pull request improving this method so that it can load your current taxonomy as directly as possible: https://github.com/codeforamerica/civic-tech-taxonomy/blob/2d0b72afdaa53ad3991c8daf94863feb68106add/import.js#L154
In our coming steps, we'll start iterating towards standard field names and formats for common attributes, and then towards processes to get everyone pulling their own taxonomy back in from this format as we gradually us PRs to move all the taxonomies towards being more completed and intermapped
The output should be a TOML file.