Closed tieugene closed 2 years ago
Thanks for reporting this. It seems that the checksum algorithm has overflow or endian issue, though I'm not sure. Does "ppc64le" represents Power PC, 64-bit, little endian? Can I see the result of "tkrzw_build_util config" on the system?
Does "ppc64le" represents Power PC, 64-bit, little endian?
Exactly.
Can I see the result of "tkrzw_build_util config" on the system?
I have no this system in my hands. Packages are building at remote virtual hosts. So I just can insert some command in build sequence and see log. Now part of .spec file is:
%prep
%autosetup
%build
autoreconf -vif
%configure
%make_build apidoc all
for bin in \
tkrzw_build_util tkrzw_str_perf tkrzw_file_perf tkrzw_dbm_perf tkrzw_dbm_util
do
LD_LIBRARY_PATH=$PWD help2man --no-info --no-discard-stderr \
--version-string='%{version}' --output="${bin}.1" \
"./${bin}"
done
%install
%make_install
# Remove static .a file
rm -f %{buildroot}%{_libdir}/lib%{name}.a
# mans
install -d %{buildroot}%{_mandir}/man1
install -t %{buildroot}%{_mandir}/man1 -m 0644 -p tkrzw_*.1
%check
%make_build check-light
Where I have to insert tkrzw_build_util config
?
I released 1.0.14. Now, "tkrzw_build_util config" is run at the beginning of "make check" and "make check-light" so we can confirm the result in the log.
https://kojipkgs.fedoraproject.org//work/tasks/5013/76265013/build.log PS. you can notify me if/when bug will be fixed and before flushing new tag - I will test it on Fedora build farm.
Thanks. Could you check the following version, where the checksum returns 0 always. https://dbmx.net/tkrzw/pkg/tkrzw-1.0.14-nochecksum.tar.gz
If it works, the checksum is the culprit. If not, I have to investigate other parts.
On Sat, Sep 25, 2021 at 8:02 PM Eugene @.***> wrote:
https://kojipkgs.fedoraproject.org//work/tasks/5013/76265013/build.log
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/24#issuecomment-927104713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVRA4KU7G6MW6L6XDU4TUDWT4DANCNFSM5EXKICIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Thanks. The problem doesn't seem so simple. I have to obtain a PowerPC environment to investigate it more.
Another notable difference between 1.0.10 and 1.0.11 is this change. https://github.com/estraier/tkrzw/commit/073bdc9d1a745766874d0516e9017dde46eaf37d
So, could you check this package where the change is reverted? https://dbmx.net/tkrzw/pkg/tkrzw-1.0.14-nommapfix.tar.gz
As I wrote "I proposed that this is because of @073bdc9 that fixed i686 bug. But reverting this commit's changes back has no effect". But I'll try
Oops, I was confused. Perhaps, the bug had already been in 1.0.10 but not found because check-light was not there.
I tried to rebuild original 1.0.13 with the patch:
diff -dur tkrzw-1.0.13.orig/tkrzw_dbm_hash.cc tkrzw-1.0.13/tkrzw_dbm_hash.cc
--- tkrzw-1.0.13.orig/tkrzw_dbm_hash.cc 2021-09-22 21:08:26.000000000 +0300
+++ tkrzw-1.0.13/tkrzw_dbm_hash.cc 2021-09-25 10:14:30.466455004 +0300
@@ -1508,8 +1508,7 @@
}
} else {
if (old_is_set) {
- eff_data_size_.fetch_add(
- static_cast<int32_t>(new_value.size()) - static_cast<int32_t>(old_value.size()));
+ eff_data_size_.fetch_add(new_value.size() - old_value.size());
} else {
num_records_.fetch_add(1);
eff_data_size_.fetch_add(key.size() + new_value.size());
diff -dur tkrzw-1.0.13.orig/tkrzw_dbm_tree.cc tkrzw-1.0.13/tkrzw_dbm_tree.cc
--- tkrzw-1.0.13.orig/tkrzw_dbm_tree.cc 2021-09-22 21:08:26.000000000 +0300
+++ tkrzw-1.0.13/tkrzw_dbm_tree.cc 2021-09-25 10:17:19.167476222 +0300
@@ -1904,8 +1904,7 @@
node->page_size +=
static_cast<int32_t>(new_rec_size) - static_cast<int32_t>(old_rec_size);
node->dirty = true;
- eff_data_size_.fetch_add(
- static_cast<int32_t>(new_value.size()) - static_cast<int32_t>(old_value_size));
+ eff_data_size_.fetch_add(new_value.size() - old_value_size);
if (static_cast<int32_t>(new_value.size()) > old_value_size) {
if (CheckLeafNodeToDivide(node)) {
reorg_ids_.Insert(std::make_pair(node->id, std::string(records.front()->GetKey())));
Oops... Your last source ('nommapfix') works: https://koji.fedoraproject.org/koji/taskinfo?taskID=76273881 If this is the solution it is not required to make new tag. Now Fedora has 0.9.16 in stable repo and 1.0.13 in testing. I can just add required patch for 1.0.14 on next build.
@estraier , if you need access to a ppc64le hw, please send me an email to sharkcz at fedoraproject dot org, I can arrange you an account on our public machine.
I'm glad if I can do some experiments on the ppc64le environment. We shouldn't be complacent without knowing the cause of the failure.
On Sun, Sep 26, 2021 at 2:45 AM Dan Horák @.***> wrote:
@estraier https://github.com/estraier , if you need access to a ppc64le hw, please send me an email to sharkcz at fedoraproject dot org, I can arrange you an account on our public machine.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/24#issuecomment-927157931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVRHC3JBCTIRNOBZS3PDUDYDFJANCNFSM5EXKICIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I'm glad if I can do some experiments on the ppc64le environment.
Seems sharkcz propose you direct ssh access to virtual ppc64le host. But this requires some moves from your side (e.g. you have to get Fedoraproject's signed certificate)
Right, I am offering a remote (ssh) access to a ppc64le virtual machine. Fedora contributors can access the Fedora public machines (https://fedoraproject.org/wiki/Test_Machine_Resources_For_Package_Maintainers), but we have also a solution for upstream projects/authors, who are not Fedora contributors. All I need is an email with your public ssh key.
Thanks Dan and Eugene. My public key is the following.
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAtWY5XD4t3uYIOTZ0WUH2BGnFAdIdkVUZplUQB34CW/YOpWoeOJy/CKfXibZ+xubH7cSPnwLgg26ZMJWaxxzF1R3jXzJCTKe/LoqdNkqjSCqSiqyXLFOf2K4PdQ5MKA80VIMO+PDmtd3noOeEN/Mu667S2IYMYu5BD9XW+haiOwvwiQd8+9y4FKKBnPjZle9u5XjgKs3QVG1bDa0phWQu95q7/7QrVQp4AFGsBNDNfOltPp/zjG+tU9MSPa5jscX0YcWo4QD+TnQaJNv9wCXdZAJvy/6Bs+k/g9CpAPbmNYuBDXwrVUXNGQaPYEEg4isMCgposDHP91oo7ZkXnU8dkQ== mikio@kyoto
Please send me an email (sharkcz at fedoraproject.org), I will need to send back you the login details :-) Thanks.
Please send me an email (sharkcz at fedoraproject.org), I will need to send back you the login details :-) Thanks.
As the author published his pubkey - you can publish login detail here. Why not? PS. Another problem is in private key, but let's wait.
It's a private system, thus I won't share details here. It's outside the official Fedora infra.
Login info sent, please check your spam box too. Gmail doesn't like me sometimes :-)
I can login on it. Thanks!
"make check" and "make check-light" of 1.0.15 work well on the ppc64le system. So, the platform doesn't seem the cause of the error. Now I'm suspecting that some resource limitation on the build system might be the cause.
@tieugene could you run "ulimit -a" on the build system? I'd like to simulate it. And, is the GCC version same as "gcc version 11.2.1 20210728 (Red Hat 11.2.1-1) (GCC)"
@tieugene could you run "ulimit -a" on the build system? I'd like to simulate it. And, is the GCC version same as "gcc version 11.2.1 20210728 (Red Hat 11.2.1-1) (GCC)"
According to log (search ==== ULIMIT/ ====
):
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 97324
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 97324
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
PS. You can compare with other archs, where builds ok. Click required build and search build.log
link.
I'm still struggling to reproduce the bug.
@tieugene Could you check the build of this package on the ppc64le system? It adds "madvice" system call after "mremap" system call though it is not necessary on the POSIX specification. https://dbmx.net/tkrzw/pkg/tkrzw-1.0.16-madvise.tar.gz
@tieugene Could you check the build of this package on the ppc64le system? It adds "madvice" system call after "mremap" system call though it is not necessary on the POSIX specification. https://dbmx.net/tkrzw/pkg/tkrzw-1.0.16-madvise.tar.gz
Failed If you have an access to vhost with PPC64le (as @sharkcz promised) you can try to build the package in-place. Just:
sudo dnf install rpm-build
(if not installed)rpmbuild --rebuild tkrzw-...src.rpm
PS. I asked vhost chief to give me same access but no response yet
Thanks. The problem seems related to the behavior of mmap so I'll investigate more. Reproduction is the key for debugging so I appreciate that I can use the build system myself.
rpmbuild doesn't fail on my account on the ppc64le. So, maybe I should consider another way to reproduce it.
I did rebuilds using mock on my local machine and on the public machine. It fails locally, but builds OK on the public one. Both used F-34 buildroot, so the software build environment was (exactly) the same. Isn't there multi-threading active in the software, for example? The public machine has only 2 vCPUs.
The command which dies is the following. It uses two active threads for test tasks and one therad waiting for them. tkrzw_dbm_perf sequence --dbm poly --path casket.tkt --iter 1000 --threads 2 --params "dbm=tree,num_buckets=500"
What happens with the following command which uses 1 active thread? tkrzw_dbm_perf sequence --dbm poly --path casket.tkt --iter 2000 --threads 1 --params "dbm=tree,num_buckets=500"
still fails
...
LD_LIBRARY_PATH=.:/usr/local/lib::. ./tkrzw_dbm_perf sequence --dbm poly --path casket.tkt \
--iter 2000 --threads 1 --params "dbm=tree,num_buckets=500"
Setting: impl=poly num_iterations=2000 value_size=8 num_threads=1
.................................................. (00000100)
.................................................. (00000200)
.................................................. (00000300)
.................................................. (00000400)
.................................................. (00000500)
.................................................. (00000600)
.................................................. (00000700)
.................................................. (00000800)
.................................................. (00000900)
.................................................. (00001000)
.................................................. (00001100)
.................................................. (00001200)
.................................................. (00001300)
.................................................. (00001400)
.................................................. (00001500)
.................................................. (00001600)
.................................................. (00001700)
.................................................. (00001800)
.................................................. (00001900)
.................................................. (00002000)
Synchronizing: ... done (elapsed=0.000696)
Setting done: elapsed_time=0.010143 num_records=2000 qps=197180 mem=4160000
file_size=41984
Getting: impl=poly num_iterations=2000 value_size=8 num_threads=1
.................................................. (00000100)
.................................................. (00000200)
.................................................. (00000300)
.................................................. (00000400)
.................................................. (00000500)
.................................................. (00000600)
.................................................. (00000700)
.................................................. (00000800)
.................................................. (00000900)
.................................................. (00001000)
.................................................. (00001100)
.................................................. (00001200)
.................................................. (00001300)
.................................................. (00001400)
.................................................. (00001500)
.........................................Get failed: BROKEN_DATA_ERROR: invalid magic checksum
(00002000)
Getting done: elapsed_time=0.007697 num_records=2000 qps=259846 mem=4224000
file_size=41984
Iterating: impl=poly num_iterations=2000 value_size=8 num_threads=1
Iterating done: elapsed_time=0.000082 num_records=2000 qps=24385488 mem=4224000
file_size=41984
Removing: impl=poly num_iterations=2000 value_size=8 num_threads=1
Synchronizing: ... done (elapsed=0.000263)
Removing done: elapsed_time=0.000350 num_records=2000 qps=5714311 mem=4288000
file_size=41984
make[1]: *** [Makefile:774: check-perf-sampler] Error 1
I will look closer what the difference between the machines could be. It's weird ...
Thanks. Reproducing the problem with one thread is a good news because the deterministic behavior is easy to analyze.
Thanks. Reproducing the problem with one thread is a good news because the deterministic behavior is easy to analyze.
By the way - one from my home host has 1 CPU with 1 (one) core (P4 3.0GHz).
@estraier, I think I have a public environment where I can reproduce the failure using @tieugene 's rpm, expect a new email :-)
Thanks Dan, Now I can reproduce it. I think that the bug is related to the mmap or mremap system call. When the error happens, the region after the offset where mremap was called for the first time has been filled with zero code. I'll investigate further to know why this happens.
On Fri, Oct 8, 2021 at 6:31 PM Dan Horák @.***> wrote:
@estraier https://github.com/estraier, I think I have a public environment where I can reproduce the failure using @tieugene https://github.com/tieugene 's rpm, expect a new email :-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/24#issuecomment-938494763, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVRGDLAOUGZCNT45XYADUF23A5ANCNFSM5EXKICIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I found the cause of the error. On the PowerPC system, memory-mapped region managed by "mmap" beyond the file size metadata managed by "ftruncate" is filled with zero. On other Linux systems, data in such regin is retained as long as the file size metadata is set to the size of mapped region before calling "msync" or "munmap". Let me illustrate this by pseudo-code.
// #1: Initialize mmap(size=10) ftruncate(size=10)
// #2: write something write(offset=0, size=5, data="01234")
// #3 truncate mmap(size=10) ftruncate(size=5) // This is smaller than the map size
// #4 write something // This is beyond the file size metadata set by fturncate write(offset=5, data="56789")
// #5 synchronize ftruncate(size=10) munmap(size=10)
The problem is, at the #4 step, the data is written beyond the size 5 set by ftruncte. On many system, the data writtin in the map is retained and finally synchronized to the file at #5 step because the metadata is set 10 before calling munmap.
Though I'm not sure, the difference of the behavior seems related to Linux version. Anyway, now that I know the cause, I can fix it.
// #3 truncate mmap(size=10) ftruncate(size=5) // This is smaller than the map size // #4 write something write(offset=5, data="56789") // #5 synchronize ftruncate(size=10) munmap(size=10) The problem is, at the #4 step, the data is written beyond the size 5 set by ftruncte. On many system, the data writtin in the map is retained and finally synchronized to the file at #5 step because the metadata is set 10 before calling munmap.
I'm not familiar in system programming at all but as for me if you try to write data beyond reserved array - nobody can warranty nothing. Is not it?
Basically, yes. The region is actually allocaed by the mmap call. However, some data is beyond the file size metadata. What happens in that case is not specified by the specification AFAIK. I've been assuming that data beyond the metadata is retained and written back to the file as long as the metadata is set before calling "munmap". It is the case in all systems where I've tested my code. But, it's not the case on the PPC system at least.
Anyway, the practical solution is to set the file size metadata aligned to the map size. I'll modify the code so.
I released 1.0.17. make check and make check-light passes on the both PPC systems. I also tested it on Ubuntu 20.10 (x64), MacOS 11.3 (M1) and Windows 11. Thanks Eugene and Dan. This fix couldn't be done without your help.
oops... Issue autoclosing not works. PR only.
As the latest version of Tkrzw (C++) abolishes the memory locking feature, test cases of olf packages of language bindings fail. Thus, I released new versions for them. As for Python, 0.1.24 is the new version.
Up to 1.0.10 everything was ok, but starting from 1.0.11 build ok for all architectures excepting ppc64le (PowerPC). I proposed that this is because of 073bdc9d1a745766874d0516e9017dde46eaf37d that fixed i686 bug. But reverting this commit's changes back has no effect - ppc64le builds (more precisely
make check
) still fails: https://koji.fedoraproject.org/koji/taskinfo?taskID=76258864 (By the way i686 build ok without 073bdc9d1a745766874d0516e9017dde46eaf37d). Tail of build (make check-light) log (https://kojipkgs.fedoraproject.org//work/tasks/8959/76258959/build.log):Something wrong starting from 1.0.11