Novartis / peax

Peax is a tool for interactive visual pattern search and exploration in epigenomic data based on unsupervised representation learning with autoencoders
http://peax.lekschas.de
Other
67 stars 14 forks source link

How would I use this on HPC? #2

Open vsoch opened 5 years ago

vsoch commented 5 years ago

I stumbled on your software, and would be interested to use it on an HPC cluster (e.g., with job management via SLURM). Do you have documentation / protocol to do this? Is running the server (with the GUI) absolutely required?

flekschas commented 5 years ago

Thanks for reaching out!

May I ask what you'd like to use Peax for? Because the answer to "Is running the server (with the GUI) absolutely required?" heavily depends on that :)

I stumbled on your software, and would be interested to use it on an HPC cluster

It depends on what you want to do with Peax. Peax essentially consists of 3 parts:

  1. The code for training a convolutional autoencoder

  2. The backend server (using Flask) for active learning, which includes everything from data preprocessing, data management, sampling, to training a random forest classifier.

  3. The frontend application (using React) for the user interface

The autoencoder part can most certainly be run on any HPC cluster. I actually submitted all jobs myself via slurm. For example, when you take a look at experiments/jobs.py you'll see that several scripts will create these slurm files for you. (That just makes me realize I need to replace the slurm template 😅 with a something local)

If you want to interactively find patterns you have to run the server. Having said that I realized that preparing the genome-wide data for multiple tracks tends to be kind of time-consuming (depending on your machine), so I am thinking to outsource the preprocessing such that that can be done on the cluster. If you're asking for that than I can try to get that working in the near future.

If you just want to use the autoencoders to find similar regions in DNase or histone mark ChIP-seq data, than this can be run on a cluster pretty easily. I don't have a script that just does that but it's super simple to implement as all the functions are essentially in place.

Let me know what you're aiming at and I can provide some more detailed information.

vsoch commented 5 years ago

I'm a general RSE, so I don't have specific analysis goals, but I'm a big fan of open source software and I saw peax and know it would be hugely valued for the user base I help, and something that we could also share on AskCI[http://ask.cyberinfrastructure.org]. I think if we could train the autoencoder and do any data preprocessing with HPC, that would be ideal, and I'd absolutely appreciate it if you put it on the queue for something in the future. I'd like to offer to help to put it all together at the end and help test and write the walkthrough (basically however I can :O) )

flekschas commented 5 years ago

Got it!

I'll clean up the code in experiments to provide a working walkthrough example. You can get an idea of what is needed if you take a look at some of the notebooks, e.g., DNase-seq 3KB.

The server application is designed to be used locally such that all the search data (meaning the classifiers you have build and the regions that you have labeled) stays on the local machine. This is not a hard constraint but right now we don't have any sophisticated database system to support multiple users (we use SQLite to store results permanently which works fine for a single user). Implementing multi-user support wouldn't be too complicated but we don't have any plans right now to do that.

Having said that, I think the best setup is:

  1. Run autoencoder training on the cluster
  2. Run the data preprocessing either on the cluster or the client machine
  3. Run the Peax server on the client machine, after having downloaded or loaded the preprocessed data.
  4. Run the front end on the client machine also. This is fundamentally not needed but it makes the setup simple and the overhead is so tiny it doesn't really matter at all.

For 1. I only need to provide an example and do some cleanup as mentioned above so that's easy to add. For 2. I need to refactor the code a little bit and think about how to make the importing / loading of preprocessed data simple.

What do you think about this setup?

vsoch commented 5 years ago

hey @flekschas this sounds great! I know notebooks are popular, but for a reproducible set of steps to run it would be good to convert to actual scripts with functions that might be used in user scripts, containers, etc. Given some data to test, if you wind up essentially converting a notebook to scripts I can offer to help.

Sqlite can (surprisingly) handle more than one user, but of course it's not meant to scale in the same way as postgres or mysql. I just glanced at the code - and you are using Flask! This should be very easy to move over to a container deployment that then allows the user to run with postgres. How about after the steps you outlined above are finished I can do a PR to containerize and update the database backend options?

When you mention running the peax server and front end separately - aren't they a packaged deal (Flask?) Ideally, you would finish up with the data, and then run a docker or docker-compose set of containers that bring up the database, server, and frontend. And for the first two points, it should still work to run off the cluster (albeit a lot slower!) but it's important that the user can choose based on the resources available to him/her.

So to give feedback on your plan - I think it's good! Steps 1 and 2 are working on the clusterizing part, and then when that's tested the server / application can be updated. It sounds like the refactoring bit is something you have in your head, and I can again offer to help with some of the Flask / database work, I've done it a few times around.