allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

Adding ability to train models using docker on Aristo AWS GPU machines #286

Closed matt-gardner closed 7 years ago

matt-gardner commented 7 years ago

You need to have bintray credentials set up, have docker working, have cloned allenai/aristo, be on the VPN, and whatnot, but if you've done all of these things, you can use ./scripts/run_on_aws.sh [name] [param_file] instead of python scripts/run_model.py [param_file], to run the training routine on a GPU machine in EC2, without locking up the machine you're working on. It's pretty amazing.

Note that all paths in the parameter file, both for loading data and for saving models, need to be under /net/efs/aristo/dlfa/ for this to work correctly. We're transitioning from the original S2 EFS drive (mounted as /efs) to the EFS drive provided by techops (mounted as /net/efs). This is what bin/aristo supports. I copied all of our data over from the S2 drive to the new place, so things should just work if you change the paths from /efs/data/dlfa/ to /net/efs/aristo/dlfa/. Note that you'll probably also need to change the path to aristo in the run_on_aws.sh script to where you have the aristo repo cloned.

FYI @ColinArenz @pdasigi @nelson-liu @DeNeutoy @bsharpataz

DeNeutoy commented 7 years ago

Oh one thing @matt-gardner - I'm not sure how file permissions work with the mounted drive, but I've found I have to have an entrypoint via a bash script which modifies the file permissions first - I'm guessing you didn't?

matt-gardner commented 7 years ago

No, I didn't need to modify permissions; generally, I've tried to make everything group writable under /net/efs/aristo/dlfa/, which should fix the permissions issues. I think there's some flag you can set to make it so that subdirectories inherit the group writable flag, but I haven't bothered with that yet...