ImperialCollegeLondon / champ

BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

JOSS Review Feedback #70

Open thurber opened 3 years ago

thurber commented 3 years ago

Hi CHAMP team, I'm a reviewer for your JOSS submission tracked here. I support my team in their use of HPC resources but don't manage any clusters myself, so I'll approach my review from that perspective. For brevity, I'll aggregate my feedback into this single issue, but feel free to open new issues from this ticket as appropriate if that helps your process.

First of all, nice work! I appreciate the clean simplicity and responsiveness of the interface, and believe CHAMP could be a very useful tool in supporting computational research efforts.

Documentation

Testing

Other thoughts

cc-a commented 3 years ago

Hi @thurber thanks for your review. I've opened a couple of issues where code changes were required and have started to pull things together in the joss-review branch. In response to your other comments...

* [ ]  The links to the example configuration are broken.

Thanks for pointing this out. Incorrect visibility on the target repository have now been updated.

Testing

* [ ]  Having a test instance of CHAMP running somewhere that potential users or curious onlookers could try out the job submission and results viewing functionality would be helpful. Alternatively, providing a docker-compose that included an OOD container with a simple job preconfigured could accomplish the same goal. I am currently attempting to set up an OOD container, so I can let you know how it goes.

This is an interesting suggestion but quite challenging I think. To take the second suggestion first, I'm not sure how much one would gain by having an OOD instance running in a container. Without an actual HPC service backing OOD testing e.g. job submission, would still not be possible. I'd be very interested to hear how you get on with setting it up in a container, I suspect it'll be quite tricky.

I do like the idea of having a test instance that people could submit against. Running this against an actual HPC service is probably out of the question, at least at Imperial there would be concerns about this.

I wonder if there is anything that could be built from ubcccr/hpc-toolset-tutorial? It's a model HPC system built from containerised components (including an OOD instance). As a starting point it would provide a mock cluster to deploy CHAMP into and actually allow submitting jobs etc. It could form either the basis for local test instances or a demo server (or maybe run on Binder?). It's pretty heavy weight though so not sure how practical it would be.

This would be quite fun to explore but in my opinion would exceed reasonable expectations for the amount of work this review should generate.

* [ ]  This is more of a brainstorm than a request, but it would be really neat if I could run CHAMP on a system I have control of, but it could interact with and submit jobs to an HPC cluster that I do NOT have control of. In other words, I as the admin could define not only the job scripts, but also the scripts necessary to communicate with the actual HPC cluster on behalf of the user. This would be particularly helpful in the case of a team without much scripting experience that makes heavy use of a third-party computing resource.

An interesting thought and makes me wonder if there is any milleage in a refactor to decouple the web front-end from the job submission system. In theory anybody could then develop their own job submission backend that does what they want e.g. submit remotely to a HPC, run workloads in the cloud or with their favourite workflow system etc. Doing remote submission to HPC clusters as a non-privileged user is very difficult to do reliably in my experience so not something I'd want to do within CHAMP itself. Would have to think a bit about how interacting with remote file systems/fetching results might work.

thurber commented 3 years ago

thanks @cc-a!

I haven't had time to explore the local HPC cluster setup yet, but I do think it may be necessary to provide some testing capabilities of the HPC integration available in some form though.

acrlakshman commented 3 years ago

Hi @cc-a while doing initial testing by following the readme without going deep into the code, I get the following error, can you provide your input

Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.............Exception during job submission
Traceback (most recent call last):
  File "/home/app/main/views.py", line 75, in create_job
    job = Job.objects.create_job(
  File "/home/app/main/models/__init__.py", line 88, in create_job
    job_id = scheduler.submit(script_path, job.work_dir)
  File "/home/app/main/tests/scheduler_mock.py", line 87, in raise_scheduler_error
    raise SchedulerError(SCHEDULER_ERROR_MESSAGE)
main.scheduler.SchedulerError: error_message
..............................................................
----------------------------------------------------------------------
Ran 75 tests in 2.463s

OK
Destroying test database for alias 'default'...

I launched a container and installed all requirements, followed by running python manage.py test, if I need to start server to resolve this, please let me know, I shall take a look at this tomorrow.

cc-a commented 3 years ago

Hi @acrlakshman. All tests are actually passing there and seem to be working as expected. That output is from a logged but handled exception. The log handler prints it to the terminal in addition to the testing output. The logger output could be disabled when the tests are run but the mechanism for doing this is a bit cumbersome.

acrlakshman commented 3 years ago

Hi @acrlakshman. All tests are actually passing there and seem to be working as expected. That output is from a logged but handled exception. The log handler prints it to the terminal in addition to the testing output. The logger output could be disabled when the tests are run but the mechanism for doing this is a bit cumbersome.

Thanks for the reply @cc-a. Can you please make note of this in a new subsection in README or somewhere of your choice, either as FAQ or NOTES.

I also faced an issue while using the Dockerfile/docker-compose, I will provide those points later today.

cc-a commented 3 years ago

Thanks for the reply @cc-a. Can you please make note of this in a new subsection in README or somewhere of your choice, either as FAQ or NOTES.

On balance I think I prefer the slightly cumbersome silencing of the logging to muddying the documentation so I'll do that.

cc-a commented 3 years ago

I haven't had time to explore the local HPC cluster setup yet, but I do think it may be necessary to provide some testing capabilities of the HPC integration available in some form though.

I actually started looking into this as well and think I have something close to finished that should allow testing against a demo slurm cluster.

thurber commented 3 years ago

hi @cc-a, any luck with testing against a demo slurm cluster?

cc-a commented 2 years ago

I've (finally) managed to get a working setup for the demo cluster that also plays nicely with GitHub actions. Available for preview in the joss-review branch. Instructions for use have been added to the README under the development guide section.

cc-a commented 2 years ago

@acrlakshman Logging messages are now suppressed when running tests in the joss-review branch.

thurber commented 2 years ago

@cc-a nice work getting a docker-compose environment set up for testing CHAMP! This is cool enough that I recommend adding a paragraph about it in the paper. One minor issue in the documentation for launching the demo cluster: for the final command to work for me I had to use: docker-compose -f demo_cluster/docker-compose.demo.yaml run --service-ports ondemand in order to actually expose the ports to my host machine. Otherwise, I'm satisfied with my portion of the review!