WPRDC / wprdc-etl

MIT License
8 stars 3 forks source link

added SFTPConnector #42

Closed saylorsd closed 8 years ago

saylorsd commented 8 years ago

Fixes #17

@bsmithgall Can you take a look particularly at what I did in connector.checksum_contents()?

bsmithgall commented 8 years ago

You have to add paramiko to the requirements. I would also add it to the setup.py requirements.

bsmithgall commented 8 years ago

I think you'll also have to modify the settings.example.json to get the tests to pass on travis.

bsmithgall commented 8 years ago

I'm going to write some additional unittests as well.

saylorsd commented 8 years ago

About the settings: I'd like to have the different connections live on the same level as the CKAN servers in the settings e.g:

{
  "staging": {
    "api_key": "your ckan api key",
    "root_url": "example: https://stage.wprdc.org/",
    "organizations": {
      "Name": "ckan-generated hash"
    },
    "statusdb": ":memory:"
  },
  "county-ftp": {
    "stuff": "things"
  }
}

But pipeline.config is from settings[server] and I don't wan't to rework the config setup without asking your first. Would it make sense to refactor our .config to something like .ckan_config, and then have a general .config for connection names, status databse names, and possible other things that may end up in there?

bsmithgall commented 8 years ago

I feel like a pipeline's runtime should only be able to access one level of configuration, but perhaps that's not right. Maybe the right thing to do is to have the whole configuration object floating around and then different pieces of the pipeline can access different pieces of configuration.

bsmithgall commented 8 years ago

At the very least, doing that sort of thing opens that file as a side effect whenever something from that package is imported, so it probably shouldn't happen.

saylorsd commented 8 years ago

Would it make sense to have a separate file for global settings that isn't written to by pipelines? Currently, the settings.json file we have provides CKAN-instance-specific information. Things like SFTP settings, RESTful API settings (in the future), etc wouldn't be specific to a pipeline or downstream CKAN instance, so I feel like there would be a lot of copying and pasting these chunks of info into each CKAN set-up.

bsmithgall commented 8 years ago

I don't think the pipelines should be writing into the configuration. In my opinion for this case configuration should absolutely be read only, and if we need information about state it should be kept in place in some other place.

bsmithgall commented 8 years ago

In either case, I think a larger configuration workaround might be in order. I think that:

  1. Reading files in pipelines is not a good idea, because if you ever import anything from that file, the read will happen as a side effect which is not good. If you have to do it, see if you can offload the reads into some method that you could call inside Pipeline.run(). That way you can safely import things without side effects.
  2. Configuration should be easy to change but have strong conventions, and setting it up shouldn't block development, so it should be easy to get started.
saylorsd commented 8 years ago

Ok. So if we were to have CKAN settings and global settings (in separate files or however), we could just read both as part of Pipeline.set_config_from_file when it's instantiated without issue, right?

bsmithgall commented 8 years ago

Yes, though again I think that all configuration that might be needed should be in one file, and if it isn't present, an InvalidConfigException should be raised. So maybe the file looks like this:

{
    "staging": {
        "general": {
            "statusdb": ":memory:"
        },
        "connector": {
            "username": "username",
            "password": "password"
        },
        "loader": {
            "ckan_api_key": "my-api-key",
            "ckan_root_url": "my-root-url",
            "ckan_organizations": {
                "Name": "some-hash"
            }           
        }
    }
}

If we do it that way, then each piece of the pipeline could get isolated configuration.

saylorsd commented 8 years ago

I think I follow, but what exactly do you mean by "...each piece of the pipeline could get isolated configuration."

bsmithgall commented 8 years ago

You could also nest things n-levels deep and pass in the correct keypath on run.

So the Pipeline would get instantiated with the whole block, and then you would pass a configuration path to the Loader.

For example:

Pipeline('name', 'Print Name', '/path/to/config.json', 'settingsString') \
    .connect(FileConnector, config='sftp.county_sftp')
    ...
    .load(CKANDatastoreLoader, config='ckan')

Then there would be some parse_config method on pipeline that would take a dot-separated string and look up that piece of the overall configuration dictionary and pass it to the appropriate object. That way the loader won't have access to the connector config, etc.

bsmithgall commented 8 years ago

I opened a new issue for this (#43) and will push up a sample implementation of what I'm talking about and you can let me know what you think

saylorsd commented 8 years ago

@bsmithgall Should I wait to merge this? Do you plan on implementing the config change in this branch?

bsmithgall commented 8 years ago

New branch, but we might want to merge that branch into this branch so we can get the fatal_od running agian

saylorsd commented 8 years ago

Sounds good!

saylorsd commented 8 years ago

@bsmithgall while testing on the staging server I remembered we needed to change CKANLoader.__init__() to check for a current instance of the resource name on CKAN and then use its resource id for the loading.

Please take a look, and then I'll merge.

bsmithgall commented 8 years ago

+1 LGTM