balena-labs-projects / rosetta-at-home

80 stars 17 forks source link

On low memory devices new tasks are not fetched after completing first job #22

Closed chrisys closed 4 years ago

chrisys commented 4 years ago

Due to the use of --fetch_minimal_work used for lower memory devices, after finishing the first task a new one is not requested until a service or device reboot.

We can resolve this with the use of --exit_when_idle which should cause the container to restart and grab a new job.

ptrm commented 4 years ago

On my raspberry Pi4 (2GB RAM) I succesfully ran without the --fetch_minimal_work param, and sometimes two tasks would run just fine, depending on their memory requirements. The remaining of the four had WMem state and resumed right after the previous finished. Maybe a way to go is to let boinc decide how many tasks can be run?

ptrm commented 4 years ago

Below are the screenshots of how running boinc without --fetch_minimal_work behaves on a 2GB raspberry pi 4: Screenshot from 2020-04-21 15-47-37 Screenshot from 2020-04-21 15-47-12

chrisys commented 4 years ago

@ptrm in our pre-release testing we found that some tasks would work fine as you're seeing, but sometimes on these lower memory devices they would get into a situation where they would continuously reboot and be unable to progress which is why this feature was implemented. Keep us posted on how your device performs over the next couple of days! It would be great to remove this restriction or lower the memory value at which it applies if devices are now stable.

ptrm commented 4 years ago

in our pre-release testing we found that some tasks would work fine as you're seeing, but sometimes on these lower memory devices they would get into a situation where they would continuously reboot and be unable to progress which is why this feature was implemented

Indeed, On the Pi 3 with 1GB and two running / 1 waiting task I am now getting reboots every several minutes during first hours of tasks (started around 13:00 UTC). Pi 4 still keeps running, though

ptrm commented 4 years ago

--exit_on_idle results in querying the server around twice a minute, as <1GB tasks are rare. It's ok for me, I managed to get tasks after 10 minutes, but changing settings via a remote manager was a pain due to restarts. Also rosetta servers might not be glad if the traffic impact is to be considered at all in case of pi3s

(below is filtered by "tasks" and "exit" keywords)

21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [---] exiting because no more results
21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [---] Time to exit
21.04.20 22:28:16 (+0200) Service exited 'boinc-client sha256:foobar'
21.04.20 22:28:23 (+0200)  ui  2020/04/21 20:28:23 Command exited for: 192.168.0.109:33418
21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [Rosetta@home] Scheduler request completed: got 0 new tasks
21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [Rosetta@home] No tasks sent
21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [---] exiting because no more results
21.04.20 22:28:08 (+0200)  boinc-client  21-Apr-2020 20:28:08 [---] Time to exit
21.04.20 22:28:40 (+0200)  boinc-client  21-Apr-2020 20:28:40 [---] Config: exit when idle
21.04.20 22:28:40 (+0200)  boinc-client  21-Apr-2020 20:28:40 [---] Config: report completed tasks immediately
21.04.20 22:28:40 (+0200)  boinc-client  21-Apr-2020 20:28:40 [---] Checking active tasks
21.04.20 22:28:40 (+0200)  boinc-client  21-Apr-2020 20:28:40 [Rosetta@home] Requesting new tasks for CPU
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [Rosetta@home] Scheduler request completed: got 0 new tasks
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [Rosetta@home] No tasks sent
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [---] exiting because no more results
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [---] Time to exit
21.04.20 22:28:50 (+0200) Service exited 'boinc-client sha256:foobar'
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [Rosetta@home] Scheduler request completed: got 0 new tasks
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [Rosetta@home] No tasks sent
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [---] exiting because no more results
21.04.20 22:28:43 (+0200)  boinc-client  21-Apr-2020 20:28:43 [---] Time to exit
21.04.20 22:29:39 (+0200)  boinc-client  21-Apr-2020 20:29:39 [---] Config: exit when idle
21.04.20 22:29:39 (+0200)  boinc-client  21-Apr-2020 20:29:39 [---] Config: report completed tasks immediately
21.04.20 22:29:39 (+0200)  boinc-client  21-Apr-2020 20:29:39 [---] Checking active tasks
21.04.20 22:29:39 (+0200)  boinc-client  21-Apr-2020 20:29:39 [Rosetta@home] Requesting new tasks for CPU
21.04.20 22:29:41 (+0200)  boinc-client  21-Apr-2020 20:29:41 [Rosetta@home] Scheduler request completed: got 0 new tasks
chrisys commented 4 years ago

It appeared to depend on the type of task sent, it was definitely observed that some devices with <2.5GB memory would be fine, and some would get into the reboot cycle state (including Pi 4s).

We could try a sleep period in start.sh which would prevent it cycling so quickly, enabling it to poll for tasks say every 10 minutes, but in that sleep period the web ui would not function since the client isn't running.

ptrm commented 4 years ago

It appeared to depend on the type of task sent, it was definitely observed that some devices with <2.5GB memory would be fine, and some would get into the reboot cycle state (including Pi 4s).

Completely understood. how about a system/service variable to override this?

We could try a sleep period in start.sh which would prevent it cycling so quickly, enabling it to poll for tasks say every 10 minutes, but in that sleep period the web ui would not function since the client isn't running.

In the case of continuous boinc-client restart, I also had trouble finding long enough window to set anything in the client before the boinctui or my pc boinc manager reported it's offline, so it was virtually the same ;)