irobert4 commented 4 years ago

Description

David recently got a cease and desist email from Argo support staff about too many jobs being run on the cluster. We need to create a more Argo-friendly version of playhearts.

Tasks

My suggestion is that we:

[x] Create another version of play_hearts.py (leave the current one alone)
- [x] Created multiple processes, essentially like forking. Since we don't want or need to share memory (deck, game manager, state, etc).
[x] Loop play_hearts.py
- [x] Bash script to loop multiple executions of play_hearts.py
[x] Lower number of array jobs submissions.

davidjha commented 4 years ago

I created another branch: multiprocess-hearts-#95. I created a multi-process version of play_hearts.py called play_hearts_multi.py. This version runs the games in their own process, and if my understanding on how python's multiprocessing.Process() works, so each game should get their own cpu or at least run in parallel in the background. However, each game runs so quick (the games are almost run in sequential order) that a simple for loop may be more efficient due to less overhead of creating new processes.

Also adjusted start_game.sh to include a for loop.

carlofazioli commented 4 years ago

It could be worthwhile asking the Argo staff how long they think a "long running" job is. University clusters are routinely running simulations or data analytics for faculty member that can take 1 or 10, or maybe 100 hours.

My intuition guides me to a solution where you have a single-threaded, single-process job that runs a loop. This loop plays the game over and over, maybe just inserting into SQL after every game, or alternatively keeping a local buffer of python game results and then inserting them periodically as a chunk. DBs usually have efficient bulk-write operations. The loop upper bound would be like N = 100,000 or even N = 1,000,000.

Multi-threading and multi-processing are powerful techniques, but also not as off-the-shelf as one might want.

Some general guidelines are that

Multi-threading is for things that you want to accomplish simultaneously but that might take longer because of it. Threading is so-called because the work from multiple items is 'threaded together' or 'interleaved' for the CPU to compute. Think if you had two 5-hour papers to write. You could process them serially and take 10 hours, or you could 'thread' them together, and work for an hour on each, back and forth, until they're done. That also takes 10 hours. Note that the same amount of work is getting done in the same amount of time. There's still just one entity doing the work.
Multi-processing requires more hardware, because you are splitting up the job and actually doing the CPU compute cycles simultaneously on different physical components. To continue the above analogy, it'd be like a Black Mirror episode where you clone yourself to each write one of the papers and be done with both in 5 hours. Note that the total amount of work done was still 10 hours! It was just done by 2 people/CPU over half the time.
Also as a general rule, for your OS, new threads are easy to make but new processes are harder. I'm not familiar enough with the details to explain why.

davidjha commented 4 years ago

For long running jobs, this isn't an issue that we are facing. I believe Carlo is correct the type of usage on university clusters can span many hours.

I chose not to have the data inserted into the DB after every game because the server was not able to handle the throughput that this approach had caused. I elected to go the bulk-write route, as it is more efficient and can be done in batches. Although, I simply write the data to files instead of keeping the data in a cache, and then write these files to the db. MySQL is able to load data from csv files faster over insert statements. Also having these files can serve as having some form of redundancy in case the DB decides to go belly up.

Threads are part of 1 process or running program that share data (global variables for instance) and address space (block of memory that the process lives in). Creating and switching between threads is less expensive to do because less data has to be fetched and copied over from memory.

Processes don't share data/address space with each other. They are separate running programs. Creation and switching (requires an OS system call to interrupt) between processes is more taxing, as more data needs to be fetched and copied.

Our intent is to save time with multi-processing. I think threading would complicate things more than necessary for Hearts. Whereas creating multiple instances of hearts to run in parallel is much simpler and since Hearts is light weight enough, this shouldn't be taxing. Although I'm trying a few things to see how much work can get done in less time.

I haven't seen Black Mirror, but I have heard good things!

davidjha commented 4 years ago

Multi-processing isn't working correctly (incorrect implementation). We're going to loop, and as well as try to submit multi play_hearts and have slurm run each of these on a cpu.

carlofazioli / cardiathena

Too Many Jobs on Argo #95

Description

Tasks