department-of-veterans-affairs / va-data

7 stars 0 forks source link

facilities.json is garbage #10

Closed WateredTrees closed 8 years ago

WateredTrees commented 9 years ago

The VA Facilities Locations file named facilities.json is garbage. Adeline Wilcox, Health System Specialist, Department of Veterans Affairs, Veterans Health Administration

QJF3 commented 9 years ago

@WateredTrees, It would help us understand improve data quality if you could be more specific about the facilities.json file. Are you finding inaccurate records, incomplete records, missing records or some combination of three?

WateredTrees commented 9 years ago

It appears to me that the file named facilities.json has been removed from the repository. In facilities.json, many facilities listed as outpatient clinics in VA Site Tracking were classified as hospitals.

jalbertbowden commented 9 years ago

you can trace a files history.....go back up the git tree until you find the facilities.json file and then you can link to it and point out the problems with it.....

WateredTrees commented 9 years ago

Tracing the deleted file named facilities.json is beyond my present knowledge of GitHub. Please keep in mind I did not post it to GitHub. However, I still have a copy of facilities.json.

Yesterday, in government email, I received the following request for help from John F. Quinn. "I responded to your post on GitHub about the facilities.json file. I appreciate the feedback but it would be more helpful if you could enumerate data quality issues then making a general comment that no one can address."

While I give many examples below, the data quality issues are too numerous to enumerate completely. Exactly, what is the purpose of the file? I believe the absence of a readme file has been noted. Why list cemeteries, regional benefits offices, vet centers, and mobile clinics in the same data file? Has anyone got a facility definition that includes all these?

In facilities.json, 2 facilities have the same value of 540 for station identifier. They have the names Clarksburg - Louis A. Johnson VA Medical Center and Rural Mobile Unit (540). In the 13Nov15 VAST Snapshot 2, the station identifier given for the Clarksburg VA Mobile Clinic is 540HK, not 540. 540 is the station identifier given for the Clarksburg - Louis A. Johnson VA Medical Center. The 5 healthcare facilities listed below and the Fort Harrison Regional Benefits Office all have the same station identifier, 436. Cut Bank VA Community Based Outpatient Clinic Hamilton Primary Care Telehealth Outreach Clinic Lewistown VA Community Based Outpatient Clinic Plentywood Primary Care Telehealth Outreach Clinic VA Montana Health Care System

Now, in facilities.json, the VA Montana Health Care System is classified as type Hospital. But in today's VAST Snapshot 2, station 436 is named Fort Harrison VA Medical Center. In the VAST Snapshot 2, no VA facility is named VA Montana Health Care System.

Why does the Clarksburg VA Mobile Clinic have latitude and longitude values?

facilities.json lists two facilities at 2500 Overlook Terrace, Madison, WI. Both have identical latitude and longitude values. The one named Madison Central Clinic, with station identifier 607AA, is not listed in today's VAST Snapshot 2, not even as Permanently Deactivated. Without further explanation, 607AA looks like an invalid value.

Here's another almost duplicate record. Two facilities, one named VA Health Care Center at Harlingen and the other VA Texas Valley Coastal Bend Health Care System, have the same station identifier value, 740. Both have the same street address but their latitude and longitude values differ slightly. And both are classified as Hospitals even though the VAST Annual Classification Crosswalk of Services... does not list the Texas Valley Coastal Bend VA Medical Center-Harlingen as a Hospital.

The address for the Fort Richardson National Cemetery in Alaska is given as "P.O. Box 5-498, Bldg 58-512, Davis Hwy". It's neither a mailing address nor a street address. From Google Maps, I got the street address "58-512 State Hwy E".

Finally, the JSON file format itself is troublesome. In the O'Reilly title, Bad Data A Handbook, edited by Q. Ethan McCallum, Tim McNamara tells us "JSON is Not the World's Best File Format". Would more data users use the data if they were published in the CSV file format? Adeline Wilcox, Health System Specialist, Veterans Health Administration Central Office.

jalbertbowden commented 9 years ago

well i certainly understand your frustrations....i'm not privy to va's internals, so i have no answers to your questions regarding data quality.

regarding the facility.json file....since they deleted i can't show you how to track history with it, but i'll show you an example from a repo that i'm working on. the following link is for the "hearings.html" file that is in the "compare-congress" repository. https://github.com/sunlightlabs/compare-congress/blob/master/hearings.html above "hearings.html"'s code is a panel that has options on the right side "Raw Blame History" click on history, which gives you a display showing all of the different "histories" that file has had in its lifetime in that repository. https://github.com/sunlightlabs/compare-congress/commits/master/hearings.html in this example, there are two options displayed, as i've edited this document 2x since its creation. clicking on either instances title takes you through to that files history https://github.com/sunlightlabs/compare-congress/commit/e4ae33ad26021e067d311c434761c8a89142ad1b the history is denoted by colors.....white background means nothing has changed, red background shows where files have been altered-literally meaning that existing code has been changed, and yellow background shows new file content that has been added, but doesn't conflict with any of the existing code.

File formats vary, what is important is that they are open source, defined, and well documented. JSON is troublesome is not a fact, it is a feeling, which is relative to each individual using it. Converting from one format to another is relatively painless; most formats have tools/libraries built specifically for this so users don't have to think about it, they can just implement it.

More web developers will use JSON simply because they should be more familiar with JSON than CSV (offhand), more number cruncher people will use CSV as they are familiar with the format and typically more comfortable using Excel/Open Office than an IDE and/or writing code to accomplish what they want.

Both CSV and JSON are now W3C specifications and have linked data possibilities baked in, arguing which is better is a waste of time.

"Would more data users use the data if they published in the CSV file format?"

More users would use the data if it were published in HTML, as it would be accessible on the web and viewable in the browser, thus lowering the most barriers to entry than the other two formats.

That doesn't necessarily mean more users will interact/utilize/build stuff with the data. But they would be using it more.

kinlane commented 8 years ago

I am into this conversation late, as I've been busy with an event. I assume I was brought in as I was the person who setup the VA Github account, and originally created the repository.

It is hard to push this conversation forward as the contents of this repository has been removed, which also removed any history--as history is tracked on files, not repos. I have a copy of this repo elsewhere if the group would lik mee to put back.

In response to @WateredTrees the open data was / is put up here to put it in a collaborative environment, in an effort to make it better. I was including several external groups, as well as internal VA groups in conversation around making the data better. I'm guessing by your statement you aren't familiar with how Github works, because rather than stating something is "garbage, and listing out what is wrong, Github empowers you to "be the change you want to see". You do this by forking the file, make changes, and submit a pull request, and then the data is no longer "garbage" -- this is the Github way.

Regarding JSON format. This is not worth commenting on, as JSON is the preferred open data format across the industry, dominating over CSV, and XML(no discussion). With the wealth of open data tooling for converting between, complaining about being in one format is rendered mute--again I think you just are not equipped to collaborate in this realm.

If this group would like me to revive the facilities.json file, and push forward with cleaning up to produce a quality dataset, I am happy to shepherd. Thanks everyone.

QJF3 commented 8 years ago

The VA is posting a new version of the facilities.json. I don't know which specific repository it will be located, but it will be under the master Department of Veterans Affairs. A couple of words about the source of the facilities.json file. The VA makes the Web API available from the facility information also available through http://www.va.gov/landing2_locations.htm. The VA uses a distributed process for entering data to this portal, assigning organizational components to update its own content. Unless the Web application is really good at enforcing data quality rules, it is easy for end users to implement some of the issues that @WateredTrees has identified. I agree it is an opportunity to improve data quality. I also wanted to note some tags that I changed listed below. I left in the code snippet only the tag values I am renaming. One of the biggest changes is dropping the internal database primary key and instead only making the VA facility ID value known. My thought process is anyone importing the data is likely to build their own primary key, and I yes I understand some of the challenges such as managing content change when all the attributes change for an existing primary key. If there is enough feedback to add the primary key from the VA data store, I will do so but likely will name it something like "EntryID."

elif colName == 'div_name':
    tmpColName = 'division'
elif colName == 'fac_internet':
    tmpColName = 'url'
elif colName == 'fac_name':
    tmpColName = 'name'
elif colName == 'phone_number':
    tmpColName = 'phone'
elif colName == 'stationid':
    tmpColName = 'facility_id'
elif colName == 'zip':
    tmpColName = 'postal_code'
elif colName == 'type_desc':
    tmpColName = 'type'
elif colName == 'reg_name':
    tmpColName = 'region'
WateredTrees commented 8 years ago

At WateredTrees/unofficial_VHA_data, I've posted a file named divipast.json holding Veterans Health Administration data on facilities. It may not yet be valid JSON but at least I don't find have any duplicate identifiers in it. If there are no objections, metadata and more information will follow later this week.

QJF3 commented 8 years ago

Here's the URL with the new facilities.json. https://raw.githubusercontent.com/department-of-veterans-affairs/VHA-Facilities/master/VAFacilityLocation.json

kinlane commented 8 years ago

@QJF3 great work. Thank you for sharing new link. Your work to create a unique identifier is important, and you obviously have put some good thought into it.

I also recommend (I know hard at VA) to try and establish a list of trusted users, that might be able to submit valuable pull requests fixing data, and helping evolve. While not everyone should be accepted, I would hope that over time a list of trusted, vetted, and verified Github accounts could be deemed worthy.

While I'm not in government anymore, I'm actively updating government data, through grant projects, and other funded work. I'm happy to help as I can, and have other folks who are looking to be data stewards.

I will keep an eye on the VAFAcilityLocation.json data and see where I can help.

WateredTrees commented 8 years ago

In VAFAcilityLocation.json, I find hundreds of duplicate values of facility_id. Is that good enough for government work @kinlane ?

QJF3 commented 8 years ago

@WateredTrees, we expect government work to set the standard. Since you are internal to the VA by your own admission in this dialogue thread, why not take the opportunity to make a positive difference and try to improve the data quality instead of just complaining about it?

jalbertbowden commented 8 years ago

on that note, big +1000 to every va employee on here putting up with shit from users. its beyond ridiculous

kinlane commented 8 years ago

Hi again @WateredTrees - The quick answer is no. The work is never done, and government should always be improving. If you work in the open data field, you know that the work is NEVER done, and you realize that WE have to be the change we want to see.

Remember -- I do not work in government, I'm on the outside trying to improve, and build on the great work they are doing at the VA.

You chose to say: "In VAFAcilityLocation.json, I find hundreds of duplicate values of facility_id."

I choose to say: Hey VA team, there are 586 records that contain duplicate facility_id's, resulting in 157 duped ids. I have forked the repo, parsed the JSON, and isolated the duplicated IDs here - https://raw.githubusercontent.com/kinlane/VHA-Facilities/master/facility-dupes.json.

You should be able to use to identify the duplicates. Maybe consider sharing more detail about schema behind facility_id on the README, then I could have even just fixed the JSON, and submitted a pull request, and your team could accept back.

You see @WateredTrees, our concern is the same, but our approaches are radically different in how we move the ball forward. Your approach just polarizes things, and makes folks who are working very hard feel stupid, where my goal is to help them understand mistakes, and using the "Github way", move the conversation forward, even if just a little bit--iterating towards the future we want.

WateredTrees commented 8 years ago

Documentation for the file named divipast.json in https://github.com/WateredTrees/unofficial_VHA_data is in a file named divipast.pdf in the metadata for divipast.csv at https://opendata.socrata.com/Government/divipast/r9vj-z4nf. Goodbye.

QJF3 commented 8 years ago

@kinlane, we will be updating the README file on the repository with an explanation how the data is managed within VA. The source system, through a Web portal, allows VA staff across the country to update content about a set of VA facilities. I appreciate the data quality work you have done and we will be using that to redirect to the specific VA staff who have content ownership rights to have the information corrected. Because this is a very distributed process, it will take some time to improve the source data quality. We can't import an improved JSON file directly because we need for the local staff to approve the changes, and that's no workflow engine to manage that currently implemented. Again, thanks the report with duplicate values of facility_id.

russellwood72 commented 1 year ago

Just want to take a moment to thank @kinlane for his awesome works when serving as the Chief API Consultant (Presidential Innovation Fellow) at the White House in 2013: https://smartbear.com/blog/kin-lane-speaks-on-lack-of-transparency-in-healthc/

During the government shutdown in 2013, he was asked to shut down APIs but refused to and instead he quitted his Presidential Innovation Fellowship. Without him, the government would have completely collapsed if the APIs were shut down: https://www.youtube.com/watch?v=RgsBilpTeiU&t=2075s

Thank you Kin Lane!

jalbertbowden commented 1 year ago

i had the pleasure of meeting @kinlane during my brief stint in dc. did not know about his refusal but does not surprise me at all. thanks for everything kin!

russellwood72 commented 1 year ago

He's awesome.

The best part is he didn't apply for the Presidential Innovation Fellowship but instead White House asked him to join directly: https://youtu.be/RgsBilpTeiU?t=1593. He's probably the only Presidential Innovation Fellow who didn't apply for the fellowship.

which shouldn't be a surprise to anyone given that he used to run IT at SAP (as a VP) and worked for Google for a while: https://youtu.be/Lev-RN-gGOY?t=683

Our country is so fortunate to have someone like Kin worked at the government.

timlane1972 commented 1 year ago

@kinlane is a godsend. Before he joined PIF class of 2013, he already coached some presidential innovation fellows (class of 2012)

Sad that the Trump administration made a terrible mistake rejecting his job application to rejoin the government in 2018 as it makes no sense he failed the background check given he was the Chief API Consultant (C-level executive) at the White House in 2013.