389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
200 stars 82 forks source link

Issue 6229 - After an initial failure, subsequent online backups fail #6230

Closed progier389 closed 6 days ago

progier389 commented 2 weeks ago

Several issues related to backup task error handling: Backends stay busy after the failure Exit code is 0 in some cases Crash if failing to open the backup directory And a more general one: lib389 Task DN collision

Solutions: Always reset the busy flags that have been set Ensure that 0 is not returned in error case Avoid closing NULL directory descriptor Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

progier389 commented 2 weeks ago

Increased the precision of the timestamp used to generate task CN: With one second precision the CI test is randomly failing because of task DN collision.

progier389 commented 2 weeks ago

Looks like there is another race condition: The second backup task still sometimes fails:

           exitCode = tasks.db2bak(backup_dir=archive_dir2, args={TASK_WAIT: True})
>           assert exitCode == 0
E           assert -1 == 0
progier389 commented 2 weeks ago

Looks like I did not fix the right place: it is the same task name conflict issue (and I do not see subsecond in task CN) [18/Jun/2024:11:33:06.073635310 +0000] conn=1 op=6 ADD dn="cn=backup_06182024_113306,cn=backup,cn=tasks,cn=config" [18/Jun/2024:11:33:06.078141131 +0000] conn=1 op=6 RESULT err=0 tag=105 nentries=0 wtime=0.000214625 optime=0.004512012 etime=0.004725204 ... [18/Jun/2024:11:33:06.282371662 +0000] conn=1 op=10 ADD dn="cn=backup_06182024_113306,cn=backup,cn=tasks,cn=config" [18/Jun/2024:11:33:06.283060970 +0000] conn=1 op=10 RESULT err=68 tag=105 nentries=0 wtime=0.000159735 optime=0.000696302 etime=0.000854243

progier389 commented 2 weeks ago

Test is still failing. now I think it is related to private tmp

[20/Jun/2024:13:17:36.826947195 +0000] - ERR - ldbm_back_ldbm2archive - mkdir(/tmp/tmpoyb8i9aj/bak2) failed; errno 2 (Unexpected dbimpl error code)
[20/Jun/2024:13:17:36.827410826 +0000] - ERR - ldbm_back_ldbm2archive - Failed removing /tmp/tmpoyb8i9aj/bak2
[20/Jun/2024:13:17:36.828107407 +0000] - ERR - task_backup_thread - Backup failed (error -1)

will change the temporary directoty

progier389 commented 1 week ago

Have to retry the first backup in loop until it fails (sometime it does not)

progier389 commented 1 week ago

Fixed @droideck remarks.