carpentries / undergrad-education-conversations

Conversations about teaching computational skills to undergraduates
12 stars 2 forks source link

Computation Resources for High-Memory Operations #14

Open wrightaprilm opened 6 years ago

wrightaprilm commented 6 years ago

Hi all-

For those of you teaching computational resource intensive courses (like genomics [high mem], or phylogenetics [long run times]), what types of resources are you using? If you're using resources external to your institution, are you willing to share successful applications?

Edit: see twitter thread, also, good insights: https://twitter.com/WrightingApril/status/920025444161814528

naupaka commented 6 years ago

I am teaching undergrad and master's level bioinformatics courses this semester. For compute or memory intensive homework projects, I am having the students log into a rack-mount System76 Linux server I negotiated for as part of my startup. For each weekly assignment, I have a script that generates a constrained (via cgroups) Docker container for each student to log into (different port for each student), and then they share the large raw datasets via a mounted read-only volume in the container. I ran the numbers at one point and calculated that even for a medium-memory instance to be kept running for weeks to months on AWS EC2 for each of 30+ students, it was cheaper to just buy the hardware.

The other option is applying to XSEDE for JetStream compute time. I applied to JetStream under their startup program (I think it was a paragraph to apply?) and got 50,000 hours of CPU time to work with (which unfortunately just expired). There are still some glitches with the Atmosphere interface, but you can get by without doing too much fiddling by using a Docker-based approach that you prototype elsewhere. I know JetStream is what @cboettig is using for his course at Cal.

Also not specific to high-memory or high-compute, but I've been really pleased so far with using GitHub and PRs for class code submission and review. It's a bit of wrangling on the back end to get it set up, but with GitHub's teachers_pet tool or their Classroom interface, plus free Travis CI, its tractable to teach 30+ students and give them useful feedback on their code, even without a TA.

wrightaprilm commented 6 years ago

Thanks for this, @naupaka. It's really great to hear what other people are doing. There's a huge expansion of data intensive computational biology at non-R1 institutions where infrastructure is really variable. Collecting this info somewhere will be really helpful, I think.

I'm leaning towards using a state-level cluster [LONI, here in Louisiana; free] for this first run of a new lab. I also plan to look into JetStream, as well, as phylogenetics student projects can run into runtime/wall clock limits.

naupaka commented 6 years ago

Also there's AWS educate, which could be a viable solution if you don't need to rent really big hardware.

rachelss commented 6 years ago

We run a university-based server for RStudio for Intro Bio. We don't need resources but we do need it set up and ready to go. I've hosted Shiny apps to play with data by spinning up Digital Ocean servers just for in-class time. For genomic work we have a student cluster. I agree with @naupaka - it's cheaper to buy a system if you're using it a lot. RStudio provides server pro and shiny server pro for free for teaching.

juefish commented 6 years ago

I teach two courses that use remote computing resources. For both of them, I use university servers we purchased for research and teaching. First is a general course in practical computing skills for science majors (mostly bio, marine science, and environmental science at my institution) wherein I teach them UNIX, Python and SQL. We use lab computers for all coding and platform specific teaching and remote resources for final data science projects as only an introduction. Considering migrating course to use Raspberry Pi's to avoid institutional IT; can be difficult to coordinate with. Second is a bioinformatics class. Use different, high memory machine for this one. Do genome assembly, assessment, and annotation with undergrads and grads here. I manage all software installation and they run analyses and manage project folders. Works, but I'm not sure if it's really the best system as of yet and kind of a bear to manage. Tried using Docker instances on student computers before with mixed success as well. Just go free time from Google for educational stuff, but haven't tested it out yet. Might use next spring.