Python Data Ingest Pipeline

emigre459 commented 4 years ago

Includes some mods to README to give instructions on how to use python code and Docker image.

review-notebook-app[bot] commented 4 years ago

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

kbmorales commented 4 years ago

Asked @kaogilvie to review because they are also a python person.

I'm down to talk best practices about repo structure for incorporating python code. @emigre459 and I discussed a bit, but open to ideas. would be nice to keep the repo as language agnostic as possible so it can be useful to the most people imho

kaogilvie commented 4 years ago

@emigre459 I'm working my way through your notebook --thank you for putting the whole flow together! There's a few things which aren't totally portable (i.e. pickling the secure keys was giving me some errors so I just added a few lines in "python/generate_drive_auth.py" as a reference for anyone cold starting this code) -- do you want feedback on them / would you like me just to make the usability edits directly and then put in a PR?

emigre459 commented 4 years ago

@kaogilvie good catch! I knew I was forgetting something (it was a big pain getting the Google Drive functionality working, so no surprise there). Please, if you've got a good way to automate that (I guess the Google-approved Quickstart would be best), go ahead and add it in. It may actually make sense to simply add as a function to make_dataset.py, but a separate module is fine too, especially if it's devoted to authentication tasks beyond Google Drive.

kaogilvie commented 4 years ago

@emigre459 totally, makes sense as a function. I hesitate to ask this question but I've been looking and can't find the make_dataset.py file (I saw it was referenced in the Makefile) -- it's not created yet, right?

emigre459 commented 4 years ago

@kaogilvie Ah yes, that's confusing. The python/README.md will need to be updated. In a compromise with @kbmorales on repo structure, I moved src/, docker/, and setup.py to the repo root. Anything else (such as notebooks and the python-specific README) are in the python/ folder. All of that other stuff (e.g .the Makefile) come from the DataKind cookiecutter template, but I haven't touched them as of yet. make_dataset.py is in the root src/ directory.

This begs the interesting question: do you think this hybrid approach is better, or is it better to put everything under the python/ directory?

kaogilvie commented 4 years ago

@emigre459 I feel like I'm missing the obvious, but I can't find it in the repo in the src folder either. Regardless, I think this setup is great! The structure allows for sensible compartmentalization while also not doing crazy things with nested folders (which is always a very annoying python problem in my experience). My opinion is to keep the hybrid approach.

Totally different topic -- are you accessing the Drive API using a service account of client ID / secret? I got the service account to work but only after I shared the folder specifically to my service account's fake email address that you can grab from the console. I'll document all of this as we go so that anyone else setting this notebook up doesn't have to reshave the yak

emigre459 commented 4 years ago

@ kaogilvie Ugh. Something weird is going on with the git exclude file(s) from this shared repo I think. For some reason, it's ignoring all of src/data/ when I do a push and I have to force-add it to each commit.

Do you have any idea why that might be? When I do git check-ignore -v src/data/ on my local machine, it returns .git/info/exclude:2:data/ src/data/. Ignoring data/ seems appropriate so I don't accidentally upload big datasets, but I have no idea why src/data/ would be ignored. When I look at .git/info/exclude from the repo root, I get

.ipynb_checkpoints/
data/
secure_keys/

so it still seems like src/data/ should be safe...

I assume this repo is the root cause of it since I've used the DataKind cookiecutter template that generates that directory before and never had issues with git tracking it. I'll keep looking into it, but please let me know if you have any ideas! I'll get the code in question pushed as soon as I correct whatever this ignoring issue is...

emigre459 commented 4 years ago

Holding off on merging this until @kaogilvie is able to verify full functionality on his system (now that src/data/ is actually included in the repo commits!).

kbmorales commented 4 years ago

Thanks y'all. If both of y'all agree it's ready for main go ahead and merge. I'd love for one/both of y'all to talk about the python pipeline at the call Tuesday if that's cool

DataKind-DC / CARES

Python Data Ingest Pipeline #24