DATA Act and dat - Githubissues

Not really an issue, more of a connecting the dots idea for consideration.

I recently read with interest: https://18f.gsa.gov/2015/04/23/the-dat-team-talks-data-streams/

Essentially enablement of data mashups while maintaining data provenance. It occurred to me that this type of system would be interesting related to DATA Act data. External groups could append data to create a more meaningful data set. Technically, I'm wondering how this would be handled with large volumes of data. But conceptually, I like the idea. It may be worth some consideration.

This could perhaps be extended to enable crowdsourcing of data quality in some sort of feedback mechanism.

thanks for the suggestion @HerschelC. We have definitely been thinking about how we can leverage Dat to make this data easier to access, version, and grok. I think we'll start to think more seriously about it when Dat is a little more mature, but definitely something to keep an eye on.

Hopefully we'll see some architectural constructs for review soon. I know that 18F is focused on Agile Development; but most of what I've seen is (great) web development. Data development is an entirely different discipline with a unique set of challenges and tools. There is agile in data development (doing data development right has used incremental methodologies for well over a decade). But I'm not seeing some of the basic planning/design artifacts that I would expect to see in data development - like a data model (which is not just an XML Schema, IMO). I know 18F defaults to open. But perhaps 18F is being overridden in this regard for whatever (good) reason. A data model is key to easily understanding what questions can (and can't) be easily answered with the data.

I understand from your background that you know data analysis; just voicing a concern shared by many. We have no idea what technically is coming and we hope to have an opportunity to talk about it before it comes. (I'm talking personally among data folk and not on behalf of any organization.) There is a web component to this exercise - a website and some APIs, but that's mainly front end stuff to me. Is there an intent to offer true analytical capabilities on the data? How? Will I have to download 5 years of bulk data just to run a trend analysis on an object class for example? I know that Christina Ho said early on that there wouldn't be a traditional data warehouse - but what is being considered then? (And how is a dialog around it occurring?) Inquiring minds want to know...

So, up until this point, 18F has not really been engaged on the software development side; our role has been to promote and assist with the public engagement of the data definitions and the data standard. There's an interesting governance at play here, which is that one department (OMB) is responsible for the data definitions of the statutorily defined elements, while another (Treasury) is responsible for implementing the standard and everything else. However, in the next couple of weeks we will be conducting some user-centered design and discovery for the new version of USASpending. I am sure we will pull heavily from the users that have been so helpful and engaged by providing their insights in this forum :).

We (18F) have done some of our own experimenting in this repo: https://github.com/18F/data-act-schemas/, which is mostly just us playing around with what a basic award level data model could look like. Are you looking for any specific types of artifacts? We're a big fan of pictures but I'm certainly interested in generating something else that would be higher value for you and other users providing feedback here.

I will confess that I am certainly more familiar with agile software development than agile data development, but in general my (personal) philosophy is this: start with small subsets of the data, see how they fit/conflict with a basic model, see what lessons you can learn from it, iterate. @bsweger probably has some more cogent thoughts on this topic, but I have always been a big fan of the Open States Project at the Sunlight Foundation (full disclosure, I used to work there). The data collected by state legislatures is as varied as the financial data collected by agencies. But they were able to have a core set of required elements, with flexible parts of the schema that still allowed for rich information to be collected from a subset of states, even if it wasn't applicable to all states. I personally believe that the DATA act will not be successful unless agencies can "dogfood" the data they are required to report out of compliance. This means providing a place in the schema where they can consolidate and report data that is specifically meaningful to them and still being able to leverage the common analytical tools of the new USASpending.

One of the interesting things about DATA act implementation is that there really is a mandate to serve very casual users of this data, and the more sophisticated users (like yourself and many others here) who want very specific information presented in a specific way. I've seen many sites fail that try to serve such distinctly different audiences, so I'm hoping this is something that the discovery process over the next few weeks can help suss out.

It's my understanding that a lot of the dialogue around the technical architecture is still very formative, so I don't know that there is much to report out, but as long as we're involved in the project we'll do everything we can to keep users engaged.

I should probably note that I speak only for myself and 18F. Please know that we certainly do appreciate the contributions you and others have provided on the data elements thus far.

Thanks for the repo link! I do so love pictures. They immediately convey so much more information than a bunch of text. I thought of several more elements in just the few minutes that I scanned the prototype model. Thanks for sharing. Hopefully we can get to a fully attributed data model (including attributes) that are released with data standards. I’d prefer a traditional ER diagram for well-structured data such as we’re defining. I think the traditional star/snowflake logical model schema with facts and dimensions is the best way to communicate to a stakeholder group interested in what they can analyze (facts) and by what attributes (dimensions). The analysts immediately jump out with “but I need to analyze X by Y too”. This may be a step deeper than you want to go – but it certainly helps with teasing out needs in user stories.

Agile is agile – which we used to call iterative before the agile thing was coined. I oversimplify things – but it’s a management methodology – taking a big problem and breaking it into small, consumable chunks with heavy emphasis on building upon incremental success as opposed to big bangs. Our actual methodology is called the “IUI Incremental Approach” to emphasize small pieces of work building to the future state. It’s been an interesting challenge adapting it to government work. In simplest terms, I’d say the primary difference is that we have 2-3 “tracks”. We usually break projects into two, getting data in (GDI) and getting data out (GDO). The third would be for a project with heavy analytics where we need to set up separate data analysis environments/sandboxes for the stats people that don’t appreciate when their data is updated while they’re building a statistical model. Iterations occur within both GDI and GDO though typically on different intervals. GDI is the data integration and quality effort – bringing online subject areas or data sets that enable some sort of new analysis. These take time (new capability is delivered on 90 day cycles in a mature environment). GDI takes time primarily because this is when you’re dealing with people and change – getting everyone to agree on definitions and tracking down the data they need. It once took me six months to get a global client to agree on something “simple” like defining the four global quadrants (N S E W). GDI is less technical and more people. Executing data quality assessments, implementing quality controls, etc. It’s a lot of facilitation and collaboration across different stakeholder groups that normally aren’t accustomed to agreeing with one-another’s views of the world. This is also where protyping helps (liked the article on protosketching btw).

GDO is the front end. Report building and generation, building meta data layer, etc. This is where users start seeing their data “for real”. Simple prototyping happens during GDI since we have to determine what data they need and how it could fit together, but most users can provide a good basic set of data needs for the initial iteration (or we can look at what they use now). It isn’t until GDO that we really start tweaking. We have to adjust the data model and system architecture to meet the analysis needs of the users. E.g., optimizing the physical model to meet the demand patterns. This is what you’d consider like typical web agile development (and actually, these can happen very quickly depending on what tools are used). Through this effort, you’re managing two backlogs, the GDI and GDO. The iterative nature means that if you find a high priority element need during GDO, you can add it to the GDI backlog for inclusion in the next iteration. Users know that new data capability is delivered on a steady cadence of delivery – this is key. They can have new data within 90 days (in a mature environment). In a steady-state system, GDI becomes more of a maintenance task – iterations can be a lot smaller. A bit of the challenge in making them too small is sufficient regression testing. The systems also often require historical data loads for new elements which come with their own demands – not something you’d want to try doing every day.

I completely agree with the idea that users (states or agencies) need flexibility to augment the DATA Act data with their own data or their own views of data to serve their needs. The way to get quality out of this effort is to make a system agencies can use. Too many times have I been brought in to find out why an expensive analytics system isn’t working – just to learn that the operations people were using their own other system for data to run their piece of the business. Problem solved in three days. The beauty of modern reporting, with high quality integrated base data, is that different groups can create whatever view they want on the data with a cross walk. The dat idea btw is what I was thinking cool related to augmenting existing data with your own data. I think volume is an issue though; unless it’s considered as a “feed” to their own system. Eager to watch how that unfolds.

I also think that this concept applies to the need you raise to serve different user bases. This is common in the analytics world. We typically have three segments, casual consumers (show me a canned report/pivot table), a “power” or advanced user that wants to build their own reports (think drag and drop of facts and dimensions/metrics and attributes) and the analytics users (people developing statistical models). Systems today are setup to handle these three sets – the key is the base, integrated store of atomic data elements which can be aggregated to serve any needed. A big risk addressed by this setup is a casual user trying to do something the model wasn’t tuned to handle (and bringing your system to its knees). Serving multiple constituents is definitely do-able – problem has been solved. My question though is that there are technical constructs needed to make this happen; not sure it can be done “on-the-fly”. Ultimately though, providing the set of integrated, quality base data along with quality reference data would allow anyone to pull information into their own technical architecture for analysis. I can’t emphasis the need for integrated data – otherwise we’ll end up with different people reporting different numbers on the same data. Everything must reconcile top to bottom. This means everyone must agree and submit data with like meaning. (GDI is the pain.)

I appreciate informal dialog. That’s the point of our open and transparent conversation on github, right? I also speak only for myself. Just sitting around a virtual table and sharing our ideas.

This is a great conversation, and I join @kaitlin and @HerschelC in sharing my personal thoughts.

@HerschelC Thanks for sharing your approach—it’s a valuable perspective. As you can see from the activity on this repo, there’s a lot of work currently happening around Getting Data In. As you point out, that’s hard and is essential to making sure the various audiences can use the results.

The iterative approach you outline is similar to my own data warehousing/BI experience, and I’m also a fan of entity relationship diagrams. In my years of working with federal spending data, a star schema is always how I think of it (e.g., obligations = fact table, agency/recipient/etc. = dimensions). However, I’ve not had much luck (at least in previous jobs) using such diagrams to communicate with stakeholders (researchers with no database/data modeling background).

For a different audience, I see how a logical ERD could be a valuable tool for understanding the data’s relationships—is a logical ERD what you’re asking for? I personally think it’s too early in the process to make decisions about the physical datastore. That said, I believe a good approach would be one that separates the front-end analysis and visualization tools from the physical implementation, i.e., not reporting tools that are tightly coupled to database tables.

Again, my views only. Thanks again for the conversation—agreed that sitting around the virtual table and sharing our ideas is helpful.

Getting business stakeholders up to speed on ERDs does take a little bit of training for the audience - but so would any modeling concept. I'm definitely not talking physical model at this point. And by ERD, I'm talking star schema ERDs, not 3NF. I actually refer to the model as a "Business Information Model"; which is essentially a logical data model, but maybe a step in the less-technical direction (e.g., focus is simplicity for business interpretation). I prefer the BIM term because it sounds less technical. I color code the entities/subject areas (which also ties into data governance - who's owning them at that level of the governance hierarchy). Getting business stakeholder buy-in on the methods is always hard, but they're going to have to step up and learn as part of taking ownership of their data (versus relegating ownership to the technical people with great intentions, but lack of business domain expertise). I also tend to always present in consumable chunks/subject areas. Rarely would I throw an enterprise model on the screen (though they can make nice wall art).

As with everything, it's an iterative process. I certainly wouldn't pop up a fully attributed model in the early stages. I start with lunch-n-learn type events talking about entities and attributes then relationships – as I guess we would with any modeling intro, though I ever veer away from calling it modeling at this point. What we’re working on, essentially embedding the use of data/information into the business, is often a change to the way many organizations operate; it requires lots of communication and baby steps. But you absolutely must have a willing partner in the business to assume the responsibility that comes with taking ownership of their data. Part of this is using the modern methods for talking about data. Once they get it though, wow. They can immediately see modeling issues (which is my personal joy - when they take over the task). I think stars are pretty easy for stakeholders to understand so long as they want to understand.

I’m really not tied to seeing any particular model format – but I would like to see a graphical depiction of the data. Sorting through the words and between the pages on the website is a chore when a simple picture would serve the need. I’m a little unclear on how we’d separate a front end model from back-end model, but I think that comment was in reference to a physical model (which we don’t need yet). I’m talking essentially about what I call a BIM – a way for users to understand the data and relationships across the data; as well as possible constraints in those relationships (e.g., if dimension X isn’t related to hierarchy Y of fact Z, then we can’t report by X). As with everything, a little something is good, doesn’t have to be the finished product. I’d like to see these released in iterations along with the data and exchange standards. Certainly I would hope as a way to enabling agencies to best understand these standards. Any good data modeler will immediate start creating their own. Would be nice to do it once and have it reused rather than creating dozens of new copies.

One final thought with modeling and extensibility – agencies (the really smart, forward-thinking ones :) ) will use this as an opportunity to improve their financial internal reporting. They may want to extend the model by incorporating their own data or levels of granularity. Anything the central group (OMB/Treasury or the council) can do to created work and tools that can be leveraged is helpful, IMO. Sharing of best practices and not re-inventing the wheel is important.

Good convo, going to close for now.

fedspendingtransparency / fedspendingtransparency.github.io

DATA Act and dat #56