Closed ejhumphrey closed 7 years ago
@ffont + @alastair, I poked y'all as stakeholders in the uploader CLI ... this is equal parts RFC and PR, curious to hear what you think and if / how this will serve our collective needs.
couple things to note:
Reviewed 5 of 5 files at r1. Review status: all files reviewed at latest revision, all discussions resolved.
Comments from Reviewable
I'm not sure if I managed to do the review properly with this Reviewable thing. Why don't we simply use Github's code review tools?
Anyway, it all looks good to me so far. The only real concern I have is that if we are to upload large amounts of files it will most certainly happen that we will need to pause/resume the process at some point. Therefore it would be great to be be able to do this somehow. One way is to ask the server if a file has been already uploaded (before uploading), but this requires an request per resource. Another way is to parse the generated logs to figure out which resources have already been successfully uploaded. For what I understand a different log file will be created at each run of the tool (https://github.com/cosmir/open-mic/blob/b202844b207a6d367a5658f87717510569d1fab6/scripts/audio_uploader.py#L91), so we would need to parse all of them (still much faster than the extra request per resource).
Also, why do we log everything at the end of execution? If we are to rely on logs we should log at each job (or in small chunks of N jobs) so we don't lose information if something goes wrong.
What do you think?
thanks! I'll try to break my reply out into its constituent parts.
Reviewable vs github review: When we started the project, folks requested Reviewable for (what I assume) its advanced features over the built-in GH review. In the time since, GH has improved their native review machinery. And, for what it's worth, I am not a fan of Reviewable -- it might be worth a separate issue, but I'm happy to propose using the native GH review tools.
Pause / Resume: Yes, this is a good and necessary feature. One idea I'd considered is that an object's URI <kind>:<gid>
is deterministic, seeded by a hash of the object. This functionality could be made more visible, allowing the pre-check you suggest. It would cost a request, as you point out, but I'm not too worried about that, because...
De-duplication: I think we might benefit from a distributed de-duplication strategy regardless of whatever local machinery we use for pausing / resuming uploads. This would (hopefully) help us accelerate the process (use multiple machines / instances), while helping avoid redundant ingestion. We might want to treat uploads as a request for ingestion, the process could act like "hey, can I (user) upload to this URI?", "yes, here's a token / no", "Alright, here's the token + data". But ... maybe this is overkill, if we can tightly control the upload process for v0.
Logging: Agreed, logging should happen incrementally, rather than at the end -- this was my hacky "build-improve" solution. For the logging to be useful, we need a log format that can be easily parsed, and I wasn't sure if Python's logging
would do the right thing during parallelization (i.e. does it use some kind of mutex to write to the log's buffer so messages don't get mangled?). If we don't go the logging
route, another solution I like it to use atexit
, which allows for callbacks to be run at before the interpreter shuts down. I'd propose this as the easiest solution, which allows us to use a structured data format (e.g. JSON) to allow easily parsing logs, while guarding against data loss.
okay, lot there 😄 ... thoughts?
I have no problem dropping reviewable for this project. I agree that its UI is terrible, but I still like it more than GH for its threading and partial status marking.
opened issue #41 about gh-review, brought this up to date with master (no conflicts). I'll take a look at failsafe logging after lunch.
Comments from Reviewable
update: my logging suggestion isn't going to work; I think we'll need to use the built-in logging functionality, writing out serialized JSON objects as text. ¯\_(ツ)_/¯
and in case anyone is following along at home, my threading fears were unfounded: https://docs.python.org/3/library/logging.html#thread-safety
@ffont I think I've mostly taken care of logging and pausing / resuming on d72f813 -- it's a two step process, but I think it gets the job done. Needs some testing, but I'm not sure how to best do that just yet.
Reviewed 1 of 5 files at r1, 3 of 3 files at r2, 2 of 2 files at r3. Review status: all files reviewed at latest revision, 2 unresolved discussions.
Comments from Reviewable
Sorry for the mess with reviewable and github reviews :( I really don't know how to make the pending reviewable check pass...
Looks good. The idea is to use both commands chained right? Something like:
python audio_uploader.py path/to/filelist.json http://example.com/upload/ --log-file out.log
--> pause
python filter_successful_uploads.py path/to/filelist.json out.log remaining_filelist.json
python audio_uploader.py path/to/remaining_filelist.json http://example.com/upload/ --log-file out.log
--> pause
python filter_successful_uploads.py path/to/filelist.json out.log remaining_filelist.json
python audio_uploader.py path/to/remaining_filelist.json http://example.com/upload/ --log-file out.log
...
Maybe we could simply call filter_successful_uploads.py
functions (parse_log
and filter_successes
) from audio_upload
to easy the workflow?.
stupid reviewable ... shipping this one through, will clean up any mess it causes
First pass at an audio uploading command line tool.
Next steps include user authentication (whitelisting / admin permissions) and a better grasp on the kinds of metadata we're getting from Jamendo.
This change is