fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
275 stars 25 forks source link

Ctrl+C Error on Synology #121

Closed sheckandar closed 2 months ago

sheckandar commented 2 months ago

During the file copy stage if Ctrl+C is pressed, zpaqfranz prints an error and on some occasions remains in memory and has to be killed with killall -9 zpaqfranz command.

./zpaqfranz a "/volume1/backups/Information Technology-???" "/volume1/snapshots/Information Technology" -not "*~$*" -not "*._DAV*" -not "*._DS_Store" -nosynology -chunk 30G

Error: free(): double free detected in tcache 2

fcorbelli commented 2 months ago

Please try a -debug3 (getting the last lines of the output)

sheckandar commented 2 months ago
0/08/2024 18:18:03 38348: -pre fp 15.676.560
10/08/2024 18:18:03 37896: write @   14.614.528 on fp   15.676.560 (15676560) size 65.536
10/08/2024 18:18:03 38349: -post
10/08/2024 18:18:03 38343: flush k4bis
10/08/2024 18:18:03 38068: !!!!!!!!!!!!!!!!!!!!!! unecrypted on [ENCRYPT @ 288.625.339.468] n
10/08/2024 18:18:03 38343: flush k5
10/08/2024 18:18:03 38095: MMMMMMMMMMMMMMMMMMMMMMMMMMM off 288.610.724.940 writtensofar 14.614.528 onlast 0|
10/08/2024 18:18:03 38343: flush k6
10/08/2024 18:18:03 38343: flush k7
10/08/2024 18:18:03 38343: flush k8
^C20%)   0.66% 01:42:13  ( 141.23 MB)->(  14.06 MB)=>(   2.07 GB)    3.44 MB/s

54760: CONTROL-C detected, try some housekeeping...
10/08/2024 18:18:03 54470: Closing file handle 15.676.560
10/08/2024 18:18:03 55687: Closing file handle first 15.676.560
free(): double free detected in tcache 2
Aborted (core dumped)
fcorbelli commented 2 months ago

I think I understand what the problem is, but I can't think of an easy way to solve it. I will think about it

fcorbelli commented 2 months ago

60_6k.zip This is quick-and-dirty (not really tested)

sheckandar commented 2 months ago

No, it didn't work. The behavior is exactly the same.

sheckandar commented 2 months ago

I'm running it on DSM 7.2 now. So I'm not using a statically compiled binary.

Not sure if that makes a difference.

fcorbelli commented 2 months ago

I need to check in a VM The problem is to release every file descriptor to delete all the "broken" chunks (or the garbage is left behind). I'll try with a Linux specific parsing of proc

fcorbelli commented 2 months ago

60_6k.zip

Please try this pre-release Use -verbose to get more info on closing

sheckandar commented 2 months ago

Now it doesn't hang and there is no core dump.

12/08/2024 13:38:24 38343: flush k4bis
12/08/2024 13:38:24 38068: !!!!!!!!!!!!!!!!!!!!!! unecrypted on [ENCRYPT @ 157.548.544] 8f
12/08/2024 13:38:24 38343: flush k5
12/08/2024 13:38:24 38095: MMMMMMMMMMMMMMMMMMMMMMMMMMM off 44.564.480 writtensofar 113.508.352 onlast 0|
12/08/2024 13:38:24 38343: flush k6
12/08/2024 13:38:24 38343: flush k7
12/08/2024 13:38:24 20277: expected bytes != written (media full?)  65.536 vs  4.096
12/08/2024 13:38:24 38343: flush k8
12/08/2024 13:38:24 54766: Erased chunk <</voluem1/backups/Information Technology-003.zpaq>>
12/08/2024 13:38:24 55699: Fixed-size chunk: (try) deleting all of them (up to 1)
12/08/2024 13:38:24 61754: Expected file 0 deleted 0

If that output is what was expected then the issue can be closed.

sheckandar commented 2 months ago
54760: CONTROL-C detected, try some housekeeping...
do close_open_files()   ||| 30.000.000.000
Seems Linux, scanning proc
mypid is 13664 path is |/proc/13664/fd|
File to be closed (. and .. skipped) .
File to be closed (. and .. skipped) ..
File to be closed (. and .. skipped) 0
fd 0  fd_path /proc/13664/fd/0
File to be closed (. and .. skipped) 1
fd 1  fd_path /proc/13664/fd/1
File to be closed (. and .. skipped) 2
fd 2  fd_path /proc/13664/fd/2
File to be closed (. and .. skipped) 3
fd 3  fd_path /proc/13664/fd/3
Closing fd (>2) 3
File to be closed (. and .. skipped) 4
fd 4  fd_path /proc/13664/fd/4
Closing fd (>2) 4
File to be closed (. and .. skipped) 5
fd 5  fd_path /proc/13664/fd/5
Closing fd (>2) 5
done close_open_files()
Erased chunk <</voluem1/backups/Information Technology-003.zpaq>>
Fixed-size chunk: (try) deleting all of them (up to 1)
Expected file 0 deleted 0
fcorbelli commented 2 months ago

The method used is not portable, being referable only to Linux. I am rewriting large parts (!) obviously introducing many bugs (!!) to have something (maybe) portable. The problem is quite complex given the multithreaded compression (and thus file creation) mechanism of zpaq. I expect to finish it over the weekend

fcorbelli commented 2 months ago

60_6o.zip This untested pre-release should operate in a portable way

sheckandar commented 2 months ago

Still exits as expected.

CONTROL-C detected, try some housekeeping...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
OK we get a chunk size limit of 30.000.000.000 on 2 files
g_archivefp_first handle            27.542.752
g_archivefp       handle            27.542.752
write pool before emergency close (should be 2 max 3)
write pool 00000000            27.542.752 /volume1/backups/Information Technology-001.zpaq
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Closing 00000001 handles
Closing (pool size 00000001) FP            27.542.752
Closing done, seems OK
Fixed-size chunk: (try) deleting all of them (should be 1)
Expected file 1 deleted 1

Goodbye after 10.251 seconds (00:00:10)
15/08/2024 14:36:17 54760: CONTROL-C detected, try some housekeeping...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
15/08/2024 14:36:17 61810: OK we get a chunk size limit of 30.000.000.000 on 2 files
15/08/2024 14:36:17 61892: g_archivefp_first handle            24.601.824
15/08/2024 14:36:17 61892: g_archivefp       handle            24.601.824
15/08/2024 14:36:17 61808: write pool before emergency close (should be 2 max 3)
15/08/2024 14:36:17 61810: write pool 00000000            24.601.824 /volume1/backups/Information Technology-001.zpaq
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
15/08/2024 14:36:17 61814: Closing 00000001 handles
15/08/2024 14:36:17 61817: Closing (pool size 00000001) FP            24.601.824
15/08/2024 14:36:17 39410: ***********=============== [3] 0 EUREKONAAA [v
15/08/2024 14:36:17 39414: EXIT da eliminazione fp 24.601.824
15/08/2024 14:36:17 06706: pre closing fp 24.601.824
15/08/2024 14:36:17 06710: post closing fp 24.601.824 result 0
15/08/2024 14:36:17 36253: fp now 0
15/08/2024 14:36:17 61824: Now pool size is   00000000
15/08/2024 14:36:17 61822: Closing done, seems OK
15/08/2024 14:36:17 55699: Fixed-size chunk: (try) deleting all of them (should be 1)
15/08/2024 14:36:17 55708: Erased chunk 00000000 <<15/08/2024 14:36:17 61916: file not found to be killed /volume1/backups/Information Technology-001.zpaq
15/08/2024 14:36:17 61754: Expected file 1 deleted 1

Goodbye after 19.589 seconds (00:00:19)
sheckandar commented 2 months ago

The method used is not portable, being referable only to Linux. I am rewriting large parts (!) obviously introducing many bugs (!!) to have something (maybe) portable. The problem is quite complex given the multithreaded compression (and thus file creation) mechanism of zpaq. I expect to finish it over the weekend

This is just my 2 cents.

Is there the same problem on other OS as well ? I thought this problem only applied to Linux.

Why re-write a lot of code just for 1 OS ? Just create an IF block for Linux. If there is potential for a lot of bugs on all OS, it would be more efficient to just add the code for that one OS.

fcorbelli commented 2 months ago

The method used for Linux cannot be duplicated on other systems. It enumerates open files by reading them directly from the operating system, which is not feasible, for example, on Windows systems (high privileges are needed) Previous version olny gets the first and the last FP (file handles)

The current version maintains a list of files open for writing, gradually deleting (from the vector) those that are closed (and makes debug much easier). It then closes them (the handles). And finally it deletes the files left behind from the filesystem

But there is a but.

Since compression is performed by multiple threads, it may happen that a file is opened for writing AFTER control-C is pressed. This “screws” the system of maintaining an orderly list of open files. Conventional techniques, i.e., controlling a global state variable within the compression thread cycle, comes at no small cost (i.e., slows down) In fact there are N compression threads (typically one per core) and ONE write thread. So placing the overhead inside the N threads would have had a considerable impact

So I studied a very dirty technique of getting an unopened file to write back from the function that normally does that. Basically if control-C is pressed myfopen() it returns a bad handle. At that point the (write) thread goes into error (a C++ exception) and finally exits (!)

In short, as today (it should be tested under all circumstances) control-C during compression should be well managed. There is no handling at all during extraction, because the number of concurrent open files in write can be large (the performance issue comes back). Maybe in the future I will think of some escamotage.

The question remains whether or not to truncate individual archive files to make a rollback (aka: control C on a single multi-versioned zpaq)

This is a very complicated issue and, for now, I have left it behind (i.e., the code is there, but it is commented out)

60_6p.zip

sheckandar commented 2 months ago

OK. Makes sense.

The latest version is working fine. No errors so far.