Open dmik opened 8 years ago
Ok, that's what I feared: TDB tries to map a much bigger region than the file size is (e.g. the file is 24576 bytes and it tries to mmap()
36864 bytes). So the problem mentioned in #20 doesn't look to artificial now...
I need to check the TDB code to see why it does that. I recall some fancy logic regarding file growth there. IIRC the file should be as big as the mmap attempt. May be this logic breaks somewhere due to some other OS/2 limitation.
Strictly speaking, allocating more than the file size doesn't make much sense as an attempt to access the file far beyond its EOF will cause a SIGBUS/SIGSEGV unless the file has grown that much by that time.
I disabled the mmap length limit for now (so that any value works, even if it's significantly larger than the file size). Given that the application should not write there anyway, it looks like an acceptable solution (at least for this testing).
And now it fails with this:
tdb_transaction_commit: write failed during commit
tdb_transaction_recover: recovered 12288 byte database
tdb_transaction_commit: write failed
tdb_transaction_commit failed: Not enough memory
Logging shows that this happens exactly after a mmap that is bigger than the file is created. It must be some race condition (most likely, between resizing the file and creating a new mmap of the new size). The debug version of LIBCx works most of the time, the release version fails often.
Another observation is that this course of actions also shows a munmap
call that unmaps the area which matches the real file size but smaller than the created mmap. I don't know yet if it's a deliberate attempt to release only part of the mmap (which we don't support now, see #22) or just another consequence of the same race. I tend to the latter.
I know why it fails now. There is normal I/O in TDB that expands the underlying file with ftruncate
(via the tdb_expand_file
method) each time more space is needed. The underlying file is unmapped with munmap
before expansion and then mapped again with mmap
so that its new size is picked up. In this code path all works smoothly.
Besides normal I/O there is transaction based I/O. The transaction implementation overrides some I/O methods, in particular tdb_expand_file
responsive for file expansion. However, the overridden implementation doesn't actually expand the underlying file to the requested size (it just manipulates internal transaction in-memory structures). However, mmap()
is still called with the new size after such "expansion" and this is where we get a mismatch between the actual file size and the requested region size. Since I disabled this limit check for now, the code goes on and this mmap
call succeeds (note that the actually allocated memory object matches the file size, i.e. it is smaller than requested by mmap
- the assumption here, I repeat, is that the area beyond EOF won't be used by a properly written application as it's a guaranteed crash).
Then the transaction code restores the recorded size of the new memory region (which is greater than the real file) back to the value it had before this virtual expansion (which is the value of the real file's size). Then, when a subsequent attempt to munmap
is made, this attempt is ignored by LIBCx because the size of the mapped region doesn't match the size passed to munmap
(see #22 again).
At some point later they expand the real file to the new value (while as they think it's unmapped) and then try to mmap
the file again, this time with the right size value that matches the file size. But since the old mapping in is still there (remember that LIBCx ignored the munmap
call), the underlying smaller memory object used instead of creating a new one. Then the code thinks that since it actually expanded the file to a proper size (and it in fact did that), it can access the new area beyond the old EOF. However, since our real memory object is still of the old and smaller size, such access leads to a vast majority of different side effects including crashes and data corruption.
On Linux the above scenario works because:
munmap
there can partially unmap mapped regions (by creating a split).mmap
there will remap previously mapped regions that it intersects (by creating a join).This functionality is not yet implemented on OS/2. Mostly because you can not move, shrink or expand already allocated memory regions (at least not at user space level). It still can be implemented to some extent though. I.e. in this particular case the partial munmap
actually covers the whole memory object used for the underlying file so we could release it and then allocate a new region of a bigger size to match the new file size. Something tells me there may be more cases like that so we should perhaps implement it.
But still, strictly speaking, the original TDB code is not correct. If they override file expansion (to make it virtual) they should override the unmap/map calls as well to make them virtual too (as they don't actually need a bigger memory region at that time). However, overriding the map and unmap calls isn't provided by the implementation, hence the dirtiness in the code.
I actually made the above scenario work w/o much work and tdbtorture now works much, much better. But there are still some other odd cases. In particular, sometimes the transaction code still tries to unmap the region using the old, smaller size, even after it has already unmapped this smaller mapping before and successfully created a bigger mapping later (i.e. just the next step after what I talk about in the previous comment). Since this unmap request doesn't even match the size of the file region, my quick solution doesn't work.
Constant attempts of the transaction code to unmap wrongly sized regions clearly looks like a TDB design flaw to me but again, given that it works on other platforms we have to support it too to be on the safe side...
I implemented partia/multiplel munmap support however I still see some oddities in tdbtorture but this time not in the debug build. I need to check if the right DLLs are used etc.
Ok, got some logs. From what I see it sometimes fails because msync
fails with ENOMEM. That's because of #27. Need to that it seems.
Another problem is something else. Consider this:
If there is another process that has the same file mapped in between steps 4 and 5, the file map object capable of holding 180 bytes of file data remains in use and is not freed. As a result, a mapping created at step 6 will have to use the old file map object of 180 bytes even though the file is really 8192 bytes. Of course this is ought to make troubles as this time the application is correct thinking that it can successfully access the file beyond 180 bytes.
As I already said, the TDB library tries to re-map the TDB file each time it expands it. And it generally works well (according to the code) until the transaction-driven overrides for file access kick in. Then it gets really messy and it seems the process of expanding the file size and re-mapping it goes out of control.
My only guess why it works on other platforms is because on other platforms mmap reserves real address space for the whole requested length even if its far beyond EOF. But if by the time when real memory access beyond EOF is made the file is properly expanded, no crash occurs. Unfortunately, we can't implement it on OS/2 this way because we went with file map memory objects (limited to the file size at a time of its creation) shared between different memory mappings. And of course we can't change the memory object to match the new file size (even when the file is really expanded) while there are still uses of it in other processes (as in case of the tight tdbtorture race).
Of course, improper TDB transaction code can be fixed to avoid this mess and this is what I will have to do if I won't come up with some other idea.
I have one idea. We may allocate new file maps for bigger file sizes and link them into a list in the main file map struct. Then we may use the thread that already does file sync to also sync data between different memory objects. This is rather stupid as it clearly should be done with page descriptor tables but since we don't have access to these tables in the user land and bypassing the internal kernel tables, there is no other option.
I implemented the above idea and it seems to work very well with the Samba test case (which I changed to have two mappings of different sizes to trigger the new code path, see above) and with other testcass as well. Now I need to thoroughly test it with tdbtorture (and also perform some other slight adjustments for edge cases I came across while doing that).
I committed the new code but it still gives some rare problems with tdbtorture
when it's used in multiple process mode. Reading logs gave me an idea that this might be related to file expansion. In particular, I see cases when the TDB file is expanded by one process via ftruncate
after another process has changed some bytes but before these bytes got flushed out to the file. However, the first process may rely on the fact that doing munmap
+ ftruncate
+ mmap
must flush all pending writes and bring it the latest copy of data but that's apparently not the case. And this may be a reason for various "bad magic" and other similar errors later.
I couldn't find an evidence that ftruncate
must force a flush but I found this in TDB sources (io.c):
/* expand a file. we prefer to use ftruncate, as that is what posix
says to use for mmap expansion */
and this in POSIX (http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html):
If the effect of ftruncate() is to increase the size of a memory object, it is unspecified whether the contents of any mapped pages between the old end-of-file and the new are flushed to the underlying object.
However, we already saw that Samba relies on something not guaranteed by POSIX. This may be the case. I will try to intercept ftruncate
and force a flush in all involved processes.
I think I know why it doesn't work properly. Guess this:
msync
and this forces both writing dirty pages to the underlying file and copying them to a linked memory object of process A by overwriting what it could possibly change there. Note that the flush thread of process A didn't kick in yet (because process B was too fast) so it hadn't have any chance to do sync its changes prior to process B makes its own ones.The problem is kinda fundamental. Using page permissions and exceptions, we may only trigger page changes, not individual byte changes. So it's really a big question on how to sync between these different in-memory representations of the same file object at the right time.
Of course, two processes must somehow synchronise to each other using some other means in order to not overwrite each other's bytes and TDB does use advisory fcntl locks for that. However, in case of regular files all I/O operations are serialised in terms of cache usage. Which means that a read that comes after a write (due to some sync primitive use) gets actual data even if the write's cache hasn't been flushed to the real media yet. This is guaranteed by the kernel. There is no such guarantee for our memory regions because, well, they are two separate memory regions which are not linked at all from the kernel's perspective. So even if process A does its writes under the lock, it doesn't help. When process A releases the lock, process B acquires it and legally performs its own operation but the data from A by might still be not flushed to the file or to the memory object used by process B by that time.
On other platforms it works of course because such synchronisation for memory mapped files is guaranteed by the kernel the same way it's guaranteed for regular files.
I need to think a bit if anything like this is possible on OS/2 in user land per se.
Of course, if we could access page permission flags of another process, we could invalidate the pages we change to cause a re-sync upon the next access, but again, we are not the kernel so I don't know how this can be achieved from the user land.
It turns out that
tdbtorture
from the Samba test suite still doesn't like our mmap. One issue was a forgottenDosClose
which led to "Too many open files" eventually but that's fixed in 8f6dff4fb8c06ffc8bf765b13df8939ef2956296.Now
tab_mmap
barfs with EOVERFLOW for some reason. Needs investigation.