Determine which of the requested data we have

joawu commented 7 years ago

I'm cross referencing the data team's data source tracking with the design team's data request tally to determine which of the requested data points we have.

I wanted to ask for clarification on a couple of data sources mentioned in the "first flat files for loading practice" tab on the data source tracking sheet

Which of the census data tables have been uploaded?

income distribution
race distribution
vacancy rate

May I ask what the status of real_property is (AMI data)? I checked the github issue but am not sure if its up to date.

joawu commented 7 years ago

I'm logging the status of requested data here

NealHumphrey commented 7 years ago

@emkap01, this is where the data logging status currently stands. Can you coordinate with @joawu to get this caught up with current status so you'll have what you need for the ingestion team?

emkap01 commented 7 years ago

Will do. @joawu are you going to be at the meeting tomorrow night? If so we can sync up then, if not I will compare the spreadsheet you started against the tables that are currently in the Postgres db, as well as against whatever other "raw" datasets we have in the s3 bucket. I can take a first pass at updating the spreadsheet based on those findings and then we can go from there.

thischrisoliver commented 7 years ago

I updated the Requested Data Status spreadsheet, prioritizing each request.

emkap01 commented 7 years ago

As discussed at the last meeting, the next challenge is to make sure we can account for all of the data that is of 'High' interest to end users. To do that, we will have to reconcile the topics they are most interested in with the data we have already modeled into our production db, as well as the data we have collected but not modeled, as well as the data we still need to collect.

Mapping topics to tables is not as straightforward as it may sound, because any given table may be relevant to more than one topic. Likewise, any given raw dataset may be end up being modeled into more than one table. At the Tuesday night meeting we discussed a plan to use three separate Google Sheets to do the mapping from topic > table > raw dataset (as a quick and dirty substitute for an actual relational db).

Thanks very much to all who added details into those Google Sheets. Since they are still pretty cumbersome however, I took a shot at consolidating all the information we need into a single Google Sheet, which I've labeled Manifest Tracker. It's basically a copy of manifest.csv, with some additional columns added to map each dataset to one or more topics, and to indicate the progress made on getting each dataset processed into the production db.

The goal is to use this as the basis for a dashboard that will tell us how much of the data pipeline is good to go, and how many datasets are still outstanding.

I propose that going forward we just update the Manifest Tracker file instead of the three Google Sheets we discussed at the last meeting - I think it will make things much easier for everyone.

In the meantime, if anyone on the Browse team has spare cycles to help get more data loaded into the production database, please use the dashboard to find out which ones still need processing, and work on those. If you do start working on one, please claim it in the "Notes" column of Manifest Tracker.

If anyone has any questions, comments or suggestions, feel free to hit me up directly, or to post them in the new Slack channel for Browse team. Thanks everybody!

xee5ch commented 7 years ago

Re the Google Doc, @emkap01 and @NealHumphrey, seeding that Tableau Dashboard, that is different from the one previously discussed. Shall we change the documentation to refer to that? I was working on the old one, but I had been making slow progess. This was the subject of closed PR #258.

emkap01 commented 7 years ago

@xee5ch I propose we discuss the potential change briefly at the next Tuesday meeting, and if we end up with consensus then we can change the documentation afterward. In the meantime, if you want to just continue using the old one, that's fine, I can easily update the new one manually with whatever changes are necessary (I'm not expecting a ton of new datasets to get loaded in the next week or so).

emkap01 commented 7 years ago

A related task for this issue would be for someone to write a script that can generate a csv that describes all the raw datasets in the S3 bucket. The csv needs to have three fields:

filepath
file_created_date
filename

Once we have those records created, we can just append them to the Manifest Tracker sheet, that way we will have a complete catalog of all the data available to us. The goal is to account for everything in that S3 bucket.

xee5ch commented 7 years ago

@emkap01 Ok. I just wanted to keep the docs updated since we seem to onboard people every meeting, and would like the site to serve as a good starting point so people read it before or after the meeting. I still refer to the old Google doc. If you are ok with that, I can hold off.

Re the CSV of filepath, file_created_date, and filename, have you seen the XML generated from the landing page of the S3 bucket? The XML by default includes a <ListBucketResult> with a <Content> item for each file recursively listed in all the directories of the bucket. Below is one sample I selected, as the whole thing can be quite long.

<Contents>

<Key>raw/geographic_data/census_tract_mapping/DC_census_tract_crosswalk.csv</Key>

<LastModified>2017-04-03T17:30:58.000Z</LastModified>

<ETag>"1cec5ca12263744d864f8b5280bba2bd"</ETag>

<Size>6538</Size>

<StorageClass>STANDARD</StorageClass>

</Contents>

So is this for Tableau, adding to that Google Spreadsheet manually? We can make a CSV from that if you are OK with it, or take it a step further and maybe load that directly in somehow (since most tools will handle XML for sure, but it is about your comfort level and what works best for the team).

emkap01 commented 7 years ago

@xee5ch yes right now we'd be using the data about what's in the S3 bucket to update the Manifest Tracker Google Sheet that feeds the Tableau dashboard.

Eventually I think it would be ideal to find a way to build a pipeline dashboard/tracker that just updates automatically every time a new dataset gets loaded in (or an existing dataset gets updated). But for now if all we do is manually grab the relevant data from that XML, that should do the trick.

NealHumphrey commented 7 years ago

@emkap01 and @xee5ch Hey all, so I wasn't in the discussion last week and don't know the full story of how you're planning to use this Tableau dashboard and other tracking, so definitely don't want to slow you all down.

The big challenge on data tracking is keeping things as DRY as we can.

A few observations I'd make from what I see in this issue conversation:

While it's worth doing a very quick assessment of what's in the S3 bucket, I'd encourage you not to take data sources there too seriously. Anything that's in there we can get again from its original source, and eventually we'll want these things auto-updating. It's better to spend your time focusing on requested data sources (to make the pipeline of things that aren't ready yet) and use the "use" flag in the manifest (to be the arbiter of what's completely finished)
I recommend moving your extra columns of the Manifest Tracker into the manifest itself, and update the manifest directly. To feed your dashboard you can then just copy-paste. That way you're not trying to main two versions. As long as you don't chance column names you can insert/rearrange columns of the manifest.csv as needed. Make a pull request w/ your edits (ask for help if you need it).
Finally, you can use the data documentation files (/docs/data) as a place to record more narrative status of a particular source - for example, you can add the documentation file (/docs/data/my-cool-source.md) before the file has been ingested and write your notes on status there - that way documentation is developed in-place as we go.

xee5ch commented 7 years ago

Well, I was not sure what to do because I knew one of your PRs was #105 with the Jekyll auto-population and I said they were probably something we should tackle at the same time. Given the issues on Slack and #249 a lot of this stuff might need to change the better suit the API, depending on where it gets stuffed directly into the DB, or at all.

NealHumphrey commented 7 years ago

@xee5ch and @emkap01 we can discuss tonight hopefully! Due to timing and how much code would need to change I'm currently not expecting big shifts to the meta.json or manifest.csv, though manifest will need to change to handle API sources. My main point about S3 was just to say just because something's in S3 doesn't mean it should be used, and just because something's missing doesn't mean it shouldn't.

ptatian commented 7 years ago

The data tracking sheet says that HMDA data was ingested to get market sales prices. HMDA is more valuable for getting information about home buyers, such as income or race. That info is not available elsewhere.

You can see an example of how we used these data in one of our old Housing Monitor reports.

emkap01 commented 7 years ago

@ptatian I corrected the sheet, and will make sure the correction is reflected in the new data pipeline status dashboard I'm working on.

xee5ch commented 7 years ago

Re HMDA @ptatian and @NealHumphrey, this came up at the meeting and I clarified. I verified with latest test builds the data does import cleanly as @emkap01 generously took care of that. It is only currently set to be ignored as it is large and iteratively reloading the test DB on our test environments can add minutes to quick testing as we test our Python harness. @emkap01 might want to mark this as ready with a note in his dashboard. Sorry for the confusion.

emkap01 commented 7 years ago

I updated the dash per @xee5ch's suggestion.

focusconsulting / housing-insights

Determine which of the requested data we have #188