Chicago / metalicious

An open source data dictionary which can be deployed to track the metadata of one or more databases.
Other
65 stars 22 forks source link

Add field for "Published to Data Portal" #19

Open tomschenkjr opened 10 years ago

tomschenkjr commented 10 years ago

The field/variable element should include a flag of whether the data was published to an open data portal.

danxoneil commented 10 years ago

With, perhaps, a direct link to the corresponding page on the portal.

jpvelez commented 10 years ago

Here's a simple, relatively low-maintenance way to flag and link:

The databases in the the dictionary are the sources for all/most of the datasets on a city's data portal.

To get complete transparency, and to allow cities to keep track of what ETL scripts they have floating around that get data from these systems and put it on their portal, it would be excellent to be able to associate databases in metalicious with datasets on a data portal.

You could just associate a list of Socrata dataset id's with a given database in metalicious, and fetch any additional data you might want about these datasets - their name, number of downloads, when they were last updated, whatever - from the Socrata Open Data API.

This would be a similar approach to how we built the project repository for the open gov hack night website - by only entering in Github repos and fetching all other data from the Github API. The code for that is here.

jpvelez commented 10 years ago

Come to think of it, there's probably some easyish way of pulling databases, data portal datsets, and apps together. If developers added civic.json (h/t @ryanbriones) files to their repos that simply pointed to the Socrata datasets they use, then you could imagine metalicious easily listing downstream project repos for each database by hitting the Github API.

Thoughts @derekeder, @kfogel, @evz?

kfogel commented 10 years ago

Well, it's always the question of who is the "you" in "you could just associate a list of ${THINGS} with ${OTHER_THINGS}", I think. It might make sense to expect someone maintaining a metalicious database to also include Socrata dataset IDs -- after all, that person or org is already maintaining the metalicious database in the first place, and this is just another bit of metadata (more or less).

But moving from that to the crowd of unassociated developers who write apps that use a particular database is a different matter. Some of those developers will get the memo and use civic.json to associate their app with the upstream DB, and some won't. It's not so much a technical question as an organizational one.

Sorry; not sure if that's really a constructive comment. But my point is that the first scenario is possibly something that could be relied on, whereas the second probably isn't.

tomschenkjr commented 10 years ago

@jpvelez I agree with the first comment and also agree with @kfogel regarding the latter thought.

Entering the socrata 4x4 is possible. A difficulty to consider--which I don't think is insurmountable, but considerable--is fields from multiple tables in databases may be combined to create one dataset on the data portal. I would consider this a hard-coded relationship.

I keep wondering if the better strategy is to match based on metadata elements produced by Socrata's JSON/XML/or RDF; having metalicious equate those fields with elements from it's own API. Presuming we ever get around to proper RDF tagging, that could be powerful.

/cc: @ianjkalin

jpvelez commented 10 years ago

The simplest thing to do would be to have a many to many relationship between databases and Socrata datasets, and not get into the weeds of tables at all.

Matching columns to columns seems like a pain in the ass, unless there's some clever way to do it programmatically.

tomschenkjr commented 10 years ago

So is the proposition to equate a Socrata dataset with a database? For instance, our ERP, FMPS, has purchasing data in addition to a lot of other stuff. We have a purchasing data on the portal. So equating the FMPS database with the portal dataset?

In my initial assessment, I was thinking of equating the columns of Socrata with the columns in Metalicious. Pain-in-the-ass, yes, but very robust and useful (e.g., ETL management, solid accountability for a gov't openness).

What would be the most useful for everyone else on the civic developer side? Could a database-to-dataset link be sufficient?

evz commented 10 years ago

If you added column to column reckoning, you'd kinda end up with a way to relate databases to data sets, right? While it does seem like a pain, I'd say it's probably going to be the best way to ensure you're not engineering yourself into a corner.

willpugh commented 10 years ago

On the Socrata side, I'd love to see the mapping from source columns to published dataset columns.

Tom, I think you are right that this could be very useful in terms of exposing the ETL needed and accountability.

Next step, though, would be to have tooling to hook up the originating columns to the Socrata columns on dataset import time.

derekeder commented 10 years ago

Putting my 2c in here:

Start with linking each database in datadictionary.cityofchicago.org to the relevant datasets on data.cityofchicago.org and vice-versa on each dataset in the data portal.

This will go a long ways towards figuring out what data is open/available and what each field means on datasets that have been released on the data portal. The current system of publishing one-off docs kinda works (lots of these seem to be missing now, btw), but the data dictionary is a clearly superior replacement for it.