Ejh 20161229 iss37 audio upload

ejhumphrey commented 7 years ago

First pass at an audio uploading command line tool.

consumes JSON for metadata / filename alignment
implements request retry logic
CMS logs requesting IP address.

Next steps include user authentication (whitelisting / admin permissions) and a better grasp on the kinds of metadata we're getting from Jamendo.

This change is

coveralls commented 7 years ago

Coverage remained the same at 89.51% when pulling b202844b207a6d367a5658f87717510569d1fab6 on ejh_20161229_iss37_audio_upload into 546bd468861ab8310e5a678c5e9120bf9da9ce0a on master.

ejhumphrey commented 7 years ago

@ffont + @alastair, I poked y'all as stakeholders in the uploader CLI ... this is equal parts RFC and PR, curious to hear what you think and if / how this will serve our collective needs.

couple things to note:

There should be some kind of integration test, but I'm not 100% how to do that here. I think I can wrap the uploader in a flask test app context manager, but ... iono. That sound legit to you?
This being actually useful is blocked by PR39, which would allow us to actually upload to the cloud.

ffont commented 7 years ago

Reviewed 5 of 5 files at r1. Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

ffont commented 7 years ago

I'm not sure if I managed to do the review properly with this Reviewable thing. Why don't we simply use Github's code review tools?

Anyway, it all looks good to me so far. The only real concern I have is that if we are to upload large amounts of files it will most certainly happen that we will need to pause/resume the process at some point. Therefore it would be great to be be able to do this somehow. One way is to ask the server if a file has been already uploaded (before uploading), but this requires an request per resource. Another way is to parse the generated logs to figure out which resources have already been successfully uploaded. For what I understand a different log file will be created at each run of the tool (https://github.com/cosmir/open-mic/blob/b202844b207a6d367a5658f87717510569d1fab6/scripts/audio_uploader.py#L91), so we would need to parse all of them (still much faster than the extra request per resource).

Also, why do we log everything at the end of execution? If we are to rely on logs we should log at each job (or in small chunks of N jobs) so we don't lose information if something goes wrong.

What do you think?

ejhumphrey commented 7 years ago

thanks! I'll try to break my reply out into its constituent parts.

Reviewable vs github review: When we started the project, folks requested Reviewable for (what I assume) its advanced features over the built-in GH review. In the time since, GH has improved their native review machinery. And, for what it's worth, I am not a fan of Reviewable -- it might be worth a separate issue, but I'm happy to propose using the native GH review tools.
Pause / Resume: Yes, this is a good and necessary feature. One idea I'd considered is that an object's URI <kind>:<gid> is deterministic, seeded by a hash of the object. This functionality could be made more visible, allowing the pre-check you suggest. It would cost a request, as you point out, but I'm not too worried about that, because...
De-duplication: I think we might benefit from a distributed de-duplication strategy regardless of whatever local machinery we use for pausing / resuming uploads. This would (hopefully) help us accelerate the process (use multiple machines / instances), while helping avoid redundant ingestion. We might want to treat uploads as a request for ingestion, the process could act like "hey, can I (user) upload to this URI?", "yes, here's a token / no", "Alright, here's the token + data". But ... maybe this is overkill, if we can tightly control the upload process for v0.
Logging: Agreed, logging should happen incrementally, rather than at the end -- this was my hacky "build-improve" solution. For the logging to be useful, we need a log format that can be easily parsed, and I wasn't sure if Python's logging would do the right thing during parallelization (i.e. does it use some kind of mutex to write to the log's buffer so messages don't get mangled?). If we don't go the logging route, another solution I like it to use atexit, which allows for callbacks to be run at before the interpreter shuts down. I'd propose this as the easiest solution, which allows us to use a structured data format (e.g. JSON) to allow easily parsing logs, while guarding against data loss.

okay, lot there 😄 ... thoughts?

bmcfee commented 7 years ago

I have no problem dropping reviewable for this project. I agree that its UI is terrible, but I still like it more than GH for its threading and partial status marking.

coveralls commented 7 years ago

Coverage remained the same at 90.909% when pulling 6af3fe280b7b5cd53f0b7c4d375ee829e3997381 on ejh_20161229_iss37_audio_upload into 531ef0658c9b0f5cddca863c9616cdd04edda3df on master.

ejhumphrey commented 7 years ago

opened issue #41 about gh-review, brought this up to date with master (no conflicts). I'll take a look at failsafe logging after lunch.

Comments from Reviewable

ejhumphrey commented 7 years ago

update: my logging suggestion isn't going to work; I think we'll need to use the built-in logging functionality, writing out serialized JSON objects as text. ¯\_(ツ)_/¯

ejhumphrey commented 7 years ago

and in case anyone is following along at home, my threading fears were unfounded: https://docs.python.org/3/library/logging.html#thread-safety

ejhumphrey commented 7 years ago

@ffont I think I've mostly taken care of logging and pausing / resuming on d72f813 -- it's a two step process, but I think it gets the job done. Needs some testing, but I'm not sure how to best do that just yet.

coveralls commented 7 years ago

Coverage remained the same at 90.909% when pulling d72f813abeb37378731e5c3a52c7845ddd25bccb on ejh_20161229_iss37_audio_upload into 531ef0658c9b0f5cddca863c9616cdd04edda3df on master.

ffont commented 7 years ago

Reviewed 1 of 5 files at r1, 3 of 3 files at r2, 2 of 2 files at r3. Review status: all files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

ffont commented 7 years ago

Sorry for the mess with reviewable and github reviews :( I really don't know how to make the pending reviewable check pass...

Looks good. The idea is to use both commands chained right? Something like:

python audio_uploader.py path/to/filelist.json http://example.com/upload/ --log-file out.log

--> pause

python filter_successful_uploads.py path/to/filelist.json out.log  remaining_filelist.json
python audio_uploader.py path/to/remaining_filelist.json http://example.com/upload/ --log-file out.log

--> pause

python filter_successful_uploads.py path/to/filelist.json out.log  remaining_filelist.json
python audio_uploader.py path/to/remaining_filelist.json http://example.com/upload/ --log-file out.log

...

Maybe we could simply call filter_successful_uploads.py functions (parse_log and filter_successes) from audio_upload to easy the workflow?.

ejhumphrey commented 7 years ago

stupid reviewable ... shipping this one through, will clean up any mess it causes

cosmir / openmic-annotator

Ejh 20161229 iss37 audio upload #40