Start migration of old scripts over into new ETL

saylorsd commented 8 years ago

[x] Fatal OD
[ ] 311
[x] Blotter
[x] Jail Census
[x] Kane Census
[x] PLI
[x] Asbestos Permits

bsmithgall commented 8 years ago

I'm going to take a look at this as well.

bsmithgall commented 8 years ago

@saylorsd please take a look at 13d88ed, and let me know what you think, especially of the change in the CSVExtractor. I explained the logic in the commit message.

bsmithgall commented 8 years ago

Also, what is the Kane Census?

saylorsd commented 8 years ago

:+1:
Though in the job tests, the jail and police blotter pipelines are registered as fatal od. Could end up being a problem if we start doing some integration testing?

Re: the Kane census It's a census for what I believe are County Health Department nursing homes. They were working out providing data to me on a timely matter. I haven't gotten any official word on it, but It looks like they're publishing it weekly now. I can push a sample to this branch if it's up to date with your work.

saylorsd commented 8 years ago

@bsmithgall I finally got some time to get to the police blotter pipeline. I was able to get it to pretty much work, and in doing so, realized some issues that are worth discussing. You can see the current changes here: a2ed761dcfd6e9dbc2e249583b1610b33cac1444 (note: this is a WIP)

the current fields for the police blotter are all capitalized which is might require a little bit of reworking for our schema .
I'll need to add some more flexibility to the loader class. This mainly comprises of implementing more features of the CKAN datastore api, standardizing the methods for said calls, and passing back more response information.
That damn BOM is showing up on the blotter file and causing issues where the first header item pulled from the data starts with those 3 chars.
We'll need to work out how to deal with some higher-level issues with pushing the data.
- how to track when a resource(in CKAN it's essentially one file) already exists within a package (dataset). Currently, when running the police blotter, it just creates a new file, because it is run with resource_id=''. We'll need to store the resource id's somewhere.
- Some situations will involve creating a new resource within the package after a certain period of time (e.g the Jail Census data) However, this seems like just a waste of space as each month the same data is added to two different files - so maybe it won't be an issue.

Feel free to take a look at the commit and create any issues if you see fit, I'll do the same.

bsmithgall commented 8 years ago

Nice work! If you take a look at the port-other-pipelines branch, there's a little bit in there that lets you pass an encoding kwarg, which fixes the BOM issue so long as you know the encoding ahead of time (which you should be able to know when you create a pipeline).

For fields, I think that you should be able to get them out of extractor.headers and/or extractor.schema_headers. As for the types, it might make sense to write a method that does some sort of dumb conversion based on the schema field types. I think that would be a lot better than having to do all the hard coding.

saylorsd commented 8 years ago

Nice! I can't see an issue where we won't know ahead of time unless the publisher changes their process without giving us a heads up.

Totally agree on the fields thing. I just copy and pasted these for the time being so I could see it work. I think a simple conversion like that is all we'll need since, once again, we should know all this stuff ahead of time, it can all be part of the hard-coded schema.

bsmithgall commented 8 years ago

:+1: I think that for resource_ids it should be something similar -- can't you create the resource ahead of time and get the resource id? Then it could live in the pipeline's instantiation. It doesn't seem like something that you need to configure at runtime.

Thoughts?

saylorsd commented 8 years ago

Hmm... I'll have to think about it. I'm leaning towards not, since manually creating a resource requires actually uploading a file. I'd also like to leave open the option to automatically generate resources.

I think for the cases we have now and in the foreseeable future, we can call CKAN's 'package_show' and see if a resource with the same human-readable resource_name exists instead checking against the resource_id. And then just grab the resource_id attached to the resource_name. This is something we'll know from the start and it shouldn't change.

I'll have to think more about if we'll need to automatically generate new resources entirely as I mentioned about with the Jail Census. It might be a bridge we won't need to cross.

bsmithgall commented 8 years ago

:+1:

Isn't there an existing CKANUpsertLoader method that checks for an existing resource?

saylorsd commented 8 years ago

I think there's one that pulls the names of ones in a package, but I'll make one that takes a name as an argument and returns an id.

bsmithgall commented 8 years ago

Oh great that's good too. You can dump that into the generic method run at the beginning of all CKANLoaders as well. Looking up by name is probably the best bet for consistency going forward.

saylorsd commented 8 years ago

So I forgot about resource_exists() which returns a boolean but does exactly the same work I'm looking to do. Is it bad practice to modify it to return the resource_id or None, but leave alone any implementations that evaluate its truth value? As far as I understand, they should still behave the same way.

bsmithgall commented 8 years ago

I would rewrite the implementations to handle a NoneType instead because None and False behave differently and explicit is better than implicit.

bsmithgall commented 8 years ago

Any progress on this @saylorsd ? Did you get it to upload successfully? I'm very excited.

saylorsd commented 8 years ago

Only a little -- been a bit busy. I added a get_resource_id() method and changed resource_exists() to call resource_id = get_resource_id and return resource_id is not None.

I'm currently adding some more tests for loaders and then I'll make a commit so you can see what's going on.

bsmithgall commented 8 years ago

:+1: :+1:

Is there anything else that can be worked on in the meantime?

saylorsd commented 8 years ago

Hmm.. I don't know for sure. I'll try and get this out real soon so we can go from there. Does it look like it will be easy to merge your changes in port-other-pipeline with this branch if all I've really changed is loaders and test_loader?

bsmithgall commented 8 years ago

Nah it'll probably be ok to just ditch that branch and move the changes into a new branch. Don't delete it yet though because we'll be able to recycle some of the pipelines.

saylorsd commented 8 years ago

Sounds good. I think I'm just gonna get started on these tests and commit, since I can see this turning into quite a few new tests. That way I can better show where I'm looking to go with the loader methods.

saylorsd commented 8 years ago

@bsmithgall Check out 1643a784110ebe5dfcb88ab9a55d57683cc7423b.

I renamed the CKANUpsertLoader Class to CKANDatastoreLoader to reflect the CKAN extension it uses as well as remove any ambiguity as we implement both insert and upsert methods within that class.

My todo in the TestCKANDatastoreLoader() class reflects my biggest question. How should we go about testing these two different ways of using the datastore loader? Does it make sense to keep the same tests, but run one using insert and the other using uspert?

bsmithgall commented 8 years ago

You should be able to write unit tests for each method, mocking different HTTP responses as appropriate. I think the last two things we need to do are:

[ ] do something smart with the schema to autogenerate fields for the loader
[ ] do some sort of too-soon protection against re-running too aggressively in the pipeline's pre_run method.

bsmithgall commented 8 years ago

Are the all-caps names requirements?

bsmithgall commented 8 years ago

For CKAN I mean.

bsmithgall commented 8 years ago

Also, are there only the three types in the police blotter (text, timestamp, numeric) or are there more?

bsmithgall commented 8 years ago

Hey @saylorsd check out a64d337, I've implemented a little method on the BaseSchema class that converts to the CKAN-style fields. If there are more fields needed/types, I figure we can just extend the FIELD_TO_CKAN_TYPE_MAPPING dictionary. Let me know if that works for the staging environment.

saylorsd commented 8 years ago

The all caps thing is only for the police blotter. I just don't want to change the way it already is since it might break someones app or site if they're currently pulling that data. From a glance, it seems to be a very popular dataset.

Those are the only types for the blotter.

Re: fields: Sounds great! I'll take a look when I get back from class.

bsmithgall commented 8 years ago

I've also added a bit to handle the duplicate file checking, along with a test. It's in 49c3ca6 if you want to take a look. Let's get this stuff merged soon!

saylorsd commented 8 years ago

I added the Asbestos Permit dataset to the list. Since we currently don't' have a ETL process set up for this set, this should be the highest priority. It will use the SFTP connector to pull and update it. There will also be a series of pdfs (copies of the actual permits), each relating to a record in the xls. I'll have the SFTP connector ready to merge today and hopefully I'll have something (#29) for the pdfs annd xls (#34) ready as well.

bsmithgall commented 8 years ago

Are you storing both the PDF (binary file transfer I suppose) and the Excel on the CKAN? Or do the PDFs live on some other resource?

saylorsd commented 8 years ago

They will be elsewhere, so that will also involve another loader.

bsmithgall commented 8 years ago

No I mean like can you construct regular URLs to them? Or do you have to load them yourselves?

saylorsd commented 8 years ago

Well the plan was to do both, in a sense. I was thinking of having a @post_load method in the schema to build the urls since the pdfs will be named consistently. The urls would point to a location where I store the pdfs, unless I can get the county to store them online, then they'd be to that location.

The plan isn't to load them into CKAN either way since they might muck up the view with 10's of permits.

bsmithgall commented 8 years ago

You should try to get the county to do it :stuck_out_tongue: but that sounds like the right way to do it :+1:

bsmithgall commented 8 years ago

Maybe an S3 loader with boto?

bsmithgall commented 8 years ago

For example: https://boto3.readthedocs.org/en/latest/guide/migrations3.html#storing-data

WPRDC / wprdc-etl

Start migration of old scripts over into new ETL #2