benjaminarjun commented 6 years ago

74

devinaconley commented 6 years ago

Thanks @benarthur91 - this scraper looks good.

The output format we have been trying to conform to is a table with these columns:

year, month, day, metric, count

Usually have been dumping into a CSV for testing, then will push directly to a postgres database when integrated.

Do you want to try building the table parser for those raw text files as well?

benjaminarjun commented 6 years ago

Definitely! I'll get started.

benjaminarjun commented 6 years ago

I'm moving into the testing phase with this parser. I expect to update the pull request sometime over the weekend.

I've noticed the file contains a variety of metrics, and adequately describing them in the standard schema may be difficult. For example, a label that uses all available descriptions might look something like:

Deposits and Withdrawals of Operating Cash::Deposits::Federal Reserve Account: Deposits by States: Supplemental Security Income::This month to date

It may be helpful to either:

Define the database schema similarly to the file structure, so the actual labels in the data don't need to be fully qualified.
Hone in on the metrics that are actually meaningful to this project, so they don't need to be described as precisely to distinguish from others.

I'll continue to build the parser to write fully qualified names in the CSV, but let me know if you have any input on this.

devinaconley commented 6 years ago

I think we should avoid making any changes to the database schema for this specific data source. Want to keep things as general as possible.

On the "this month to date" specifier, we actually only need to scrape the raw daily value. Something like "this month to date" would be calculated on the graphing and visualization side.

Also think about what makes sense to keep as a single table. For example in this case, it might make sense to separate deposits and withdrawals into their own tables.

benjaminarjun commented 6 years ago

Currently the parser grabs all the data for Today and writes it into a single file (one output file per source file). If it's possible to populate multiple DB tables per source file, I can look to separate the output into multiple files.

devinaconley commented 6 years ago

I think that's a good idea. Looking at how many different metrics are in each file, we will likely want a configurable way to filter specific metrics and map them to a specific database.

Also, when we push a certain set of metrics to the database, data from different days will all be in the same table.

This is looking good!

benjaminarjun commented 6 years ago

I've added a config file for the parser. This allows the caller to specify the names of the files that should be output, and which data fields should go in which file. The metric name is still the "fully qualified" attribute name as mentioned above. Quick guide to configuration:

A collection of file targets is specified. Each file target has a default regex pattern; attributes whose fully qualified name match the default pattern will go to that file. The default set of file targets is already specified, but can be modified/deleted, and new targets added.
The caller can also specify mapping overrides: this consists of a pattern for an attribute name and a file ID. Attributes matching the pattern will go to the specified file. This takes precedence over a match on a particular file's default attribute pattern.
If any attributes are not matched by the file or override config, the parser will raise an exception and exit.

One thing I noticed is that some of the values in the source file are asterisks rather than integers, and at the bottom is a note explaining the asterisk: "Statutory debt limit is temporarily suspended through December 8, 2017". How should the parser handle these?

Data4Democracy / usa-dashboard

Add daily Federal Tax Revenue scraper and resulting data files #79

74