fetch_minimal_work param: 1GB Raspberry Pi 3 continues to run many tasks at once

ptrm commented 4 years ago

which results in reboots.

I tried to investigate whether any client / server config overrides might have happened, but found none. I assume the param should override server settings at least, and below is the global override which contains only cpu usage settings. The file list is in case you needed any other content, I've got it tarballed on my laptop.

/media/piotr/resin-data/docker/volumes/1644552_boinc/_data# ls -l
total 180
-rw-r--r-- 1 root root   926 kwi 22 13:32 account_boinc.bakerlab.org_rosetta.xml
-rw-rw-r-- 1 root root   161 kwi 22 13:32 account_boinc.rosetta.bakerlab.org.xml
-rw-r--r-- 1 root root 53556 kwi 22 10:58 all_projects_list.xml
-rw-r--r-- 1 root root 25846 kwi 22 13:34 client_state_prev.xml
-rw-r--r-- 1 root root 25847 kwi 22 13:34 client_state.xml
-rw-r--r-- 1 root root   325 kwi 22 13:33 coproc_info.xml
-rw-r--r-- 1 root root   120 kwi 22 13:34 daily_xfer_history.xml
-rw-rw-r-- 1 root root   258 kwi 21 22:18 global_prefs_override.xml
-rw-rw-r-- 1 root root     0 kwi 21 22:18 gui_rpc_auth.cfg
drwxr-xr-x 2 root root  1024 kwi 22 10:58 locale
-rw-r--r-- 1 root root     0 kwi 22 10:58 lockfile
-rw-r--r-- 1 root root 13655 kwi 22 11:26 lookup_website.html
-rw-r--r-- 1 root root 23721 kwi 22 10:58 master_boinc.bakerlab.org_rosetta.xml
drwxrwx--x 2 root root  1024 kwi 22 13:33 notices
drwxrwx--x 3 root root  1024 kwi 22 10:58 projects
-rw-r--r-- 1 root root 17375 kwi 22 11:20 sched_reply_boinc.bakerlab.org_rosetta.xml
-rw-r--r-- 1 root root  5972 kwi 22 11:20 sched_request_boinc.bakerlab.org_rosetta.xml
drwxr-xr-x 6 root root  1024 kwi 22 11:46 slots
-rw-r--r-- 1 root root   425 kwi 22 11:20 statistics_boinc.bakerlab.org_rosetta.xml
-rw-r--r-- 1 root root     0 kwi 22 10:58 stderrgpudetect.txt
-rw-r--r-- 1 root root  1020 kwi 22 13:33 stdoutgpudetect.txt
-rw-r--r-- 1 root root  2542 kwi 22 13:34 time_stats_log

/media/piotr/resin-data/docker/volumes/1644552_boinc/_data# cat global_prefs_override.xml
<global_preferences>
   <max_ncpus_pct>100.000000</max_ncpus_pct>
   <ram_max_used_busy_pct>100.000000</ram_max_used_busy_pct>
   <ram_max_used_idle_pct>100.000000</ram_max_used_idle_pct>
   <cpu_usage_limit>100.000000</cpu_usage_limit>
</global_preferences>

ptrm commented 4 years ago

From https://boinc.berkeley.edu/wiki/Client_configuration:

--exit_when_idle Exit when there are no more tasks, and report completed tasks immediately. Typically used in combination with --fetch_minimal_work. --fetch_minimal_work Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed.

"possibly with a single multicore job" is a bit unclear, and sounds to me as if there was possibility to fit many single-core tasks to each cpu core.

chrisys commented 4 years ago

@ptrm could this be because the tasks were fetched without --fetch_minimal_work in place, then they are preserved and it's relaunched? This issue is also linked with #22

ptrm commented 4 years ago

could this be because the tasks were fetched without --fetch_minimal_work in place, then they are preserved and it's relaunched? This issue is also linked with #22

@chrisys I assume not, as I reflashed the sd image to the pi3's card resulting in another machine added to balena. The downloaded image already contained the --fetch_minimal_work param (EDIT: s/image/image of the version pushed to balena/). However I will try aborting current tasks and restarting the service.

And pardon the stream of consciousness here, but now I discovered that setting use of cpus to 1/core_count maintains one task limit.

root@9828088a588e:/usr/app/boinc# grep ncpus global_prefs_override.xml 
   <max_ncpus_pct>25.000000</max_ncpus_pct>

But then multicore tasks might get throttled

ptrm commented 4 years ago

Ok, so first I reset the ncpu setting (to use at most % of the cpus) to the default 100%. Note the Number of usable CPUs has changed from 1 to 4 message in the log. Then I aborted the three waiting tasks.
Then I restarted boinc-client service to check if more tasks will be requested, but no.
Then I aborted the last task, which resulted in boinc exitting due to end of the job queue.
After boinc-service automatic restart, after some time four tasks were assigned to the client.

EDIT:

Again running simultaneously:

However the machine id is still the same on the rosetta server, might be the server is reassigning the existing remaining tasks, will check that too.

ptrm commented 4 years ago

Ok, so I reflashed pi3's sd card, the device is recognised as new on Rosetta (previously I merged the new and old machine entry) and balena, and still fetches four tasks initially.

Screenshot from 2020-04-22 15-08-21

Screenshot from 2020-04-22 15-56-37

chrisys commented 4 years ago

@ptrm great work on the testing efforts here! I wonder if in light of what you've found we should implement a setup that restricts the tasks to 25% CPU (1 core) for devices under 2.5GB instead? This would solve the issue of the container cycling and the web ui being inaccessible caused by that.

ptrm commented 4 years ago

@chrisys setting the percentage of a single core in regard to all, in this case 25%, was the only thing which prevented reboot loops for me. So yes.

On the other hand, here is my raspberry pi 4 with 2GB ram, running since the morning as the log indicates:

I personally changed the treshold in start.sh to 1.5GB

I also still wonder about multithreaded tasks, but have no knolwedge about their ram requirements and such, and if they ever reach devices with benchmarks akin to raspberrys'.

chrisys commented 4 years ago

I'd be interested to get @xginn8 view on this thread as he had done a lot of the testing before release.

jtonello commented 4 years ago

@chrisys My Jetson-Nano doesn't seem to mind the default CPU workload (granted it has more horsepower than a pi), but it's not clear to me what "done" means in the console. Two days ago, I had some "done," but those disappeared. Today I also saw some test.zip packages from Rosetta. There's definitely activity, but I'm finding it hard to track over time and wonder if the device is resetting itself. Screenshot from 2020-04-22 11-53-20

chrisys commented 4 years ago

@jtonello I believe 'Done' is the state before the results are uploaded, but yeah in the tests I'm aware of at least, we have not seen the rebooting issue exhibited on the Jetson Nano (presumably due to 4GB RAM).

xginn8 commented 4 years ago

@chrisys can we start a conversation with upstream folks about the nature of these jobs generally? I think it'll go a long way to understanding the most effective workarounds. I agree about the ambiguity in wording surrounding multicore jobs. I had not seen the ncpu config option, that does seem like a better approach :facepalm:

ptrm commented 4 years ago

@xginn8 @chrisys

can we start a conversation with upstream folks about the nature of these jobs generally? I think it'll go a long way to understanding the most effective workarounds.

Thumbs up, especially that I seem to have found another case related to resource overuse induced hangup / reboot. Though this might as well be a case for a separate issue.

After some hours of running boinc balena image on raspberry pi 4 with 4GB RAM, the ui service hangs and refuses to start. The communication from balena dashboard is limited to ssh access, other command like reboot or restart return errors. Despite that, the boinc client itself runs smoothly, is accessible via remote connection from boinc manager and boinctui, and also makes regular contact with rosetta server, requesting tasks and sendin finished ones. I assume this might have to do with expansive ram usage of the tasks, which mabe has some way of being limited, too, to leave e.g. 128 for ui.

Screenshots follow: (I cant remember which one is the 4GB one ;) )

Side note: Is it in any way dangerous to post screenshots with image hashes and short device id hashes?

chrisys commented 4 years ago

@ptrm take a look at this PR, I took onboard your comments about the 4GB device as well, I think reducing the memory usage to 95% should ensure enough remains free whilst still meeting the minimum requirements for the tasks on each device (1GB/2GB/4GB).

https://github.com/balenalabs/rosetta-at-home/pull/27

ptrm commented 4 years ago

@chrisys #27 looks great :)

I will anyway try for myself to lower the 1GB pi3 ram usage to 85% and see what comes of it. I feel I waste at least 1-2 cores which could run for smaller tasks.

balena-labs-projects / rosetta-at-home

fetch_minimal_work param: 1GB Raspberry Pi 3 continues to run many tasks at once #25