estraier / tkrzw

a set of implementations of DBM
Apache License 2.0
164 stars 20 forks source link

"too small file" error when database was not closed #19

Closed onderweg closed 3 years ago

onderweg commented 3 years ago

I'm using the C bridge, in combination with HashDBM, version 0.9.38.

If tkrzw_dbm_close was not called for some reason, and program is terminated, next time the database is opened with tkrzw_dbm_open, the result is NULL, and tkrzw_last_status_message returns "too small file".

The only way I found to restore the database is with

$ tkrzw_dbm_util rebuild <file>

($ tkrzw_dbm_util restore <file> does not work for that case)

I know that tkrzw_dbm_close always should be called. But in practice, it can always happen that it was not called. For example when an unexpected system reboot or program fault occurs.

Is there any way to prevent issues while opening a previously unclosed database? For example an auto restore in tkrzw_dbm_open?

estraier commented 3 years ago

Hi Gerwert,

By design, auto restore should happen via the C bridge too. If not, it seems a bug. I'll investigate related logic. I would be happy if you give me a small code to reproduce the issue.

Calling tkrzw_dbm_synchronize sometimes is a good practice to prevent/mitigate data corruption.

On Wed, Jul 7, 2021 at 3:35 AM Gerwert @.***> wrote:

I'm using the C bridge, in combination with HashDBM.

If tkrzw_dbm_close was not called for some reason, next time the database opened with tkrzw_dbm_open, the result is NULL, and tkrzw_last_status_message returns "too small file".

The only way I found to restore database is with $ tkrzw_dbm_util rebuild

. I know that tkrzw_dbm_close always *should* be called. But in practice, it can always happen that it was not called. For example when an unexpected system reboot or program fault occurs. Is there any way to prevent issues while opening a previously unclosed database? For example an auto restore in tkrzw_dbm_open? ā€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or unsubscribe .
estraier commented 3 years ago

I found that the restore feature fails if the file size is less than 4096. An empty database file with the smallest setting must be 4096 bytes or more. It means that your process must have died while creating the database file. I'm interested in why this can happen. Could you tell me the parameter given to the tkrzw_dbm_open method?

Anyway, rebuilding the too-small file means making an empty database file. I should consider whether the restore feature should do so silently or not.

onderweg commented 3 years ago

Thanks for the quick investigation!

It is reproducable by simply using the example C source code in the docs, and commenting out tkrzw_dbm_close, and with the open call changed to:

  TkrzwDBM *dbm = tkrzw_dbm_open(
      "casket.tkh", true, "truncate=false,num_buckets=100");

(different from the example is truncate=false)

After running the program for a second time, opening the database fails, because it was not closed the first time.

I was just testing (not a production system), so indeed in the case that it happend to me, the DB size is small (only a few items).

I will try to see what happens if tkrzw_dbm_synchronize is called after records are inserted.

estraier commented 3 years ago

Thanks for the reply. I tried running the code with truncate=false and without calling tkrzw_dbm_close. On my environment, an 1MB file is left. It's by design. The initial data size is 1MB and it is truncated into the actual used size when we call the close method. And, auto restoration is done when we open the 1MB file for the second time.

I'm not sure why the broken file becomes so small on your environment. What's your OS and file system? Is there any quota for data size?

On Wed, Jul 7, 2021 at 2:16 PM Gerwert @.***> wrote:

Thanks for the quick investigation!

It is reproducable by simply using the example C source code in the docs https://dbmx.net/tkrzw/api/tkrzw__langc_8h.html, and commenting out tkrzw_dbm_close, with the open call changed to:

TkrzwDBM *dbm = tkrzw_dbm_open( "casket.tkh", true, "truncate=false,num_buckets=100");

(different from the example is truncate=false)

After running the program for a second time, opening fails.

I was just testing (not a production system), so indeed in the case that it happend to me, the DB size is small (only a few items).

I will try to see what happens if tkrzw_dbm_synchronize after records are inserted.

ā€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/19#issuecomment-875287987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVRBMJBLHJMDK4IRKQKLTWPPJXANCNFSM475DQ2RA .

onderweg commented 3 years ago

Ah, I see my initial report might not be entirely accurate.

So it seems like the 3rd time, the file is shrunken by the tkrzw_dbm_open call, and can't be opened anymore.

Platform is MacOS Big Sur.

estraier commented 3 years ago

On my environment, the third time also succeeds. Could you tell me the version of Tkrzw?

onderweg commented 3 years ago

I tried on two systems today:

Side note (but that's a separate issue): building on Mac M1 only works, if I remove -march=native manually from the g++ command in the MakeFile. Otherwise, the error "the clang compiler does not support -march-native is shown".

onderweg commented 3 years ago

Upgraded to Library version: 1.28.0, package version: 0.9.39. Problem solved šŸ˜Š

I was on 0.9.38, because that was last version available here. But I now used the source in this git repo to build the library. Apparently that's version 0.9.39. db too small issue not reproducable with this version šŸŽ‰

estraier commented 3 years ago

I updated the restore logic just now. By the update, the restore operation is omitted if validation check passes. However, failure of the restoration operation is another issue. I cannot understand why the file shrank on your environment. I'm guessing your M1 mac hits disk-full or disk quota or it has a very old version of the library installed in the past.

onderweg commented 3 years ago

There is plenty of space left on the disk. Also, there was no older version of the library installed in the past. Can removing march=native (see https://github.com/estraier/tkrzw/issues/19#issuecomment-875594499) have anything to do with it?

estraier commented 3 years ago

They say that the clang (LLVM) compiler used behind GCC doesn't support -march=native. https://stackoverflow.com/questions/65966969/why-does-march-native-not-work-on-apple-m1 Could you use -mcpu=native instead?

Anyway, removing -march=native doesn't seem related to the shrinking issue. Could you try 0.9.37 on your M1 mac?

onderweg commented 3 years ago

Sure. Tried 0.9.37 on M1 Mac:

So maybe something related with the M1 architecture?

estraier commented 3 years ago

Thanks. By design, the behavior of the library shouldn't be affected by the architecture. As the M1 processor is little endian, issues related to the byte order shouldn't happen. Intel Mac also is little endian. Could you try 0.9.38 on your Intel Mac? If it works, I have to suspect M1 as a cause of the error.

onderweg commented 3 years ago

Yes, works. Version 0.9.38 on Intel Mac does not have that error.

estraier commented 3 years ago

Thanks for checking it. I'll investigate the reason why M1 behaves differently.

BTW, I commited a change by which -mcpu=native is used instead of -march=native on Mac and M1.

onderweg commented 3 years ago

Thanks for the compiler switch change!

estraier commented 3 years ago

I noticed that the byte order of the M1 processor is big endian. So, maybe the difference comes from the byte order. To confirm it, could you tell me the result of the following command on your M1 mac?

tkrzw_build_util config

estraier commented 3 years ago

Ah, I was wrong. M1 and other ARM64 processors are bi-endian and Apple uses the little endian mode. https://developer.apple.com/documentation/apple-silicon/porting-your-macos-apps-to-apple-silicon

Hmm, I have to investigate more. I wish I had M1 Macbook.

kawanet commented 3 years ago

reproduced.

$ git log | head -1
commit 39c0a6dcc538a0f949f0770a235520bd9d9ee512

$ ./configure # without any options
(snip)

$ make all check 
(snip)
DYLD_LIBRARY_PATH=.:/usr/local/lib:  ./tkrzw_dbm_util rebuild --dbm hash casket.tkh
DYLD_LIBRARY_PATH=.:/usr/local/lib:  ./tkrzw_dbm_util inspect --dbm hash --validate casket.tkh
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=casket.tkh
  cyclic_magic=5
  pkg_major_version=0
  pkg_minor_version=9
  static_flags=5
  offset_width=4
  align_pow=3
  closure_flags=1
  num_buckets=9
  num_records=4
  eff_data_size=36
  file_size=4192
  mod_time=1625752148.825058
  db_type=0
  max_file_size=34359738368
  record_base=4096
  update_mode=in-place
  record_crc_mode=crc-8
  record_comp_mode=none
Actual File Size: 4192
Number of Records: 4
Should be Rebuilt: false
Validating records: ... done (elapsed=0.000009)
DYLD_LIBRARY_PATH=.:/usr/local/lib:  ./tkrzw_dbm_util restore --dbm hash casket.tkh casket-new.tkh
RestoreDatabase failed: BROKEN_DATA_ERROR: too small file
make[1]: *** [check-hashdbm-util] Error 1
make: *** [check] Error 2

$ tkrzw_build_util config

PACKAGE_VERSION: 0.9.21
LIBRARY_VERSION: 1.11.0
OS_NAME: Mac OS X
IS_BIG_ENDIAN: 0
PAGE_SIZE: 16384
TYPES: void*=8 short=2 int=4 long=8 long_long=8 size_t=8 float=4 double=8 long_double=8
prefix: /usr/local
includedir: /usr/local/include
libdir: /usr/local/lib
bindir: /usr/local/bin
libexecdir: /usr/local/libexec
appinc: -I/usr/local/include
applibs: -L/usr/local/lib -ltkrzw -lstdc++ -lpthread -lm -lc 

$ sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Apple M1

$ xcodebuild -version
Xcode 12.5
Build version 12E262

$ sw_vers
ProductName:    macOS
ProductVersion: 11.3
BuildVersion:   20E232
onderweg commented 3 years ago

I noticed that the byte order of the M1 processor is big endian. So, maybe the difference comes from the byte order. To confirm it, could you tell me the result of the following command on your M1 mac?

tkrzw_build_util config

OS_NAME: Mac OS X
IS_BIG_ENDIAN: 0
PAGE_SIZE: 16384
TYPES: void*=8 short=2 int=4 long=8 long_long=8 size_t=8 float=4 double=8 long_double=8
COMPRESSORS: zlib
PROCESS_ID: 38716
prefix: /usr/local
includedir: /usr/local/include
libdir: /usr/local/lib
bindir: /usr/local/bin
libexecdir: /usr/local/libexec
appinc: -I/usr/local/include
applibs: -L/usr/local/lib -ltkrzw -lz -lstdc++ -lpthread -lm -lc
estraier commented 3 years ago

I released 0.9.41 where this issue is HIDDEN (not fixed) by another logic to omit the restore operation. To reproduce this issue, the following patch is useful.

diff --git a/example/langc_ex1.c b/example/langc_ex1.c
index b9ceb6d..0b01544 100644
--- a/example/langc_ex1.c
+++ b/example/langc_ex1.c
@@ -18,7 +18,7 @@
 int main(int argc, char** argv) {
   // Opens the database file.
   TkrzwDBM* dbm = tkrzw_dbm_open(
-      "casket.tkh", true, "truncate=true,num_buckets=100");
+      "casket.tkh", true, "truncate=false,num_buckets=100");

   // Stores records.
   tkrzw_dbm_set(dbm, "foo", -1, "hop", -1, true);
@@ -48,7 +48,7 @@ int main(int argc, char** argv) {
   tkrzw_dbm_iter_free(iter);

   // Closes the database file.
-  tkrzw_dbm_close(dbm);
+  // tkrzw_dbm_close(dbm);

   return 0;
 }
diff --git a/tkrzw_dbm_hash.cc b/tkrzw_dbm_hash.cc
index 64ddef5..46671c0 100644
--- a/tkrzw_dbm_hash.cc
+++ b/tkrzw_dbm_hash.cc
@@ -276,7 +276,7 @@ Status HashDBMImpl::Open(const std::string& path, bool writable,
                file_size_ == act_file_size) {
       healthy_ = true;
       closure_flags_ |= CLOSURE_FLAG_CLOSE;
-    } else if (tuning_params.restore_mode == HashDBM::RESTORE_DEFAULT &&
+    } else if (false && tuning_params.restore_mode == HashDBM::RESTORE_DEFAULT &&
                (static_flags_ & STATIC_FLAG_UPDATE_IN_PLACE) &&
                file_size_ <= act_file_size &&
                ValidateRecordsImpl(record_base_, act_file_size,

On my Intel Macbook, this issue is not reproduced even if I applied the patch. I'll investigate this issue further when I obtained M1 Macbook.

kawanet commented 3 years ago

make check failure on M1 is not fixed yet at c00eaf68b šŸ„ŗ

DYLD_LIBRARY_PATH=.:/usr/local/lib:  ./tkrzw_dbm_util restore --dbm hash casket.tkh casket-new.tkh
RestoreDatabase failed: BROKEN_DATA_ERROR: too small file
make[1]: *** [check-hashdbm-util] Error 1
make: *** [check] Error 2
estraier commented 3 years ago

I released 0.9.46 which solves this issue. The reason was that the page size got by sysconf(_SC_PAGESIZE) is 16384 on M1 Mac only. (Intel Mac has 4096). With the old version, files whose size is less than the page size were considered broken. I fixed that logic and modified related code in order not to depend on the page size unnecessarily.

onderweg commented 3 years ago

Good to hear you found the issue. Thanks! Iā€™m not in the position to test this week, but will afterwards.

kawanet commented 3 years ago

confirmed āœ… all green results with make check on my M1 MBP! šŸ˜ƒ

onderweg commented 3 years ago

Mmm, tried to test with current master branch, package version 0.9.49.

Result: make, make check for tkrzw is working fine. However after I compile any program with -ltkrzw now, even if it does not contain any calls to tkrzw, it is instantly killed:

zsh: killed 

(runningn same empty program without -ltkrzw works fine)

Crash report in Console App says:

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00000001044c0000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2
onderweg commented 3 years ago

Tested now with 0.9.46 release. That works fine. So initial issue solved!

(the above mentioned issue might still be interesting to look at though)