Closed ptrm closed 4 years ago
From https://boinc.berkeley.edu/wiki/Client_configuration:
--exit_when_idle Exit when there are no more tasks, and report completed tasks immediately. Typically used in combination with --fetch_minimal_work. --fetch_minimal_work Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed.
"possibly with a single multicore job" is a bit unclear, and sounds to me as if there was possibility to fit many single-core tasks to each cpu core.
@ptrm could this be because the tasks were fetched without --fetch_minimal_work
in place, then they are preserved and it's relaunched? This issue is also linked with #22
could this be because the tasks were fetched without --fetch_minimal_work in place, then they are preserved and it's relaunched? This issue is also linked with #22
@chrisys I assume not, as I reflashed the sd image to the pi3's card resulting in another machine added to balena. The downloaded image already contained the --fetch_minimal_work
param (EDIT: s/image/image of the version pushed to balena/). However I will try aborting current tasks and restarting the service.
And pardon the stream of consciousness here, but now I discovered that setting use of cpus to 1/core_count
maintains one task limit.
root@9828088a588e:/usr/app/boinc# grep ncpus global_prefs_override.xml
<max_ncpus_pct>25.000000</max_ncpus_pct>
But then multicore tasks might get throttled
Ok, so first I reset the ncpu setting (to use at most % of the cpus) to the default 100%. Note the Number of usable CPUs has changed from 1 to 4
message in the log. Then I aborted the three waiting tasks.
Then I restarted boinc-client
service to check if more tasks will be requested, but no.
Then I aborted the last task, which resulted in boinc exitting due to end of the job queue.
After boinc-service
automatic restart, after some time four tasks were assigned to the client.
EDIT:
However the machine id is still the same on the rosetta server, might be the server is reassigning the existing remaining tasks, will check that too.
Ok, so I reflashed pi3's sd card, the device is recognised as new on Rosetta (previously I merged the new and old machine entry) and balena, and still fetches four tasks initially.
@ptrm great work on the testing efforts here! I wonder if in light of what you've found we should implement a setup that restricts the tasks to 25% CPU (1 core) for devices under 2.5GB instead? This would solve the issue of the container cycling and the web ui being inaccessible caused by that.
@chrisys setting the percentage of a single core in regard to all, in this case 25%, was the only thing which prevented reboot loops for me. So yes.
On the other hand, here is my raspberry pi 4 with 2GB ram, running since the morning as the log indicates:
I personally changed the treshold in start.sh to 1.5GB
I also still wonder about multithreaded tasks, but have no knolwedge about their ram requirements and such, and if they ever reach devices with benchmarks akin to raspberrys'.
I'd be interested to get @xginn8 view on this thread as he had done a lot of the testing before release.
@chrisys My Jetson-Nano doesn't seem to mind the default CPU workload (granted it has more horsepower than a pi), but it's not clear to me what "done" means in the console. Two days ago, I had some "done," but those disappeared. Today I also saw some test.zip packages from Rosetta. There's definitely activity, but I'm finding it hard to track over time and wonder if the device is resetting itself.
@jtonello I believe 'Done' is the state before the results are uploaded, but yeah in the tests I'm aware of at least, we have not seen the rebooting issue exhibited on the Jetson Nano (presumably due to 4GB RAM).
@chrisys can we start a conversation with upstream folks about the nature of these jobs generally? I think it'll go a long way to understanding the most effective workarounds. I agree about the ambiguity in wording surrounding multicore jobs. I had not seen the ncpu
config option, that does seem like a better approach :facepalm:
@xginn8 @chrisys
can we start a conversation with upstream folks about the nature of these jobs generally? I think it'll go a long way to understanding the most effective workarounds.
Thumbs up, especially that I seem to have found another case related to resource overuse induced hangup / reboot. Though this might as well be a case for a separate issue.
After some hours of running boinc balena image on raspberry pi 4 with 4GB RAM, the ui service hangs and refuses to start. The communication from balena dashboard is limited to ssh access, other command like reboot or restart return errors. Despite that, the boinc client itself runs smoothly, is accessible via remote connection from boinc manager and boinctui, and also makes regular contact with rosetta server, requesting tasks and sendin finished ones. I assume this might have to do with expansive ram usage of the tasks, which mabe has some way of being limited, too, to leave e.g. 128 for ui.
Screenshots follow: (I cant remember which one is the 4GB one ;) ) |
---|
Side note: Is it in any way dangerous to post screenshots with image hashes and short device id hashes?
@ptrm take a look at this PR, I took onboard your comments about the 4GB device as well, I think reducing the memory usage to 95% should ensure enough remains free whilst still meeting the minimum requirements for the tasks on each device (1GB/2GB/4GB).
@chrisys #27 looks great :)
I will anyway try for myself to lower the 1GB pi3 ram usage to 85% and see what comes of it. I feel I waste at least 1-2 cores which could run for smaller tasks.
which results in reboots.
I tried to investigate whether any client / server config overrides might have happened, but found none. I assume the param should override server settings at least, and below is the global override which contains only cpu usage settings. The file list is in case you needed any other content, I've got it tarballed on my laptop.