damus-io / damus

iOS nostr client
GNU General Public License v3.0
1.99k stars 288 forks source link

Reposts occasionally show unformatted #2323

Closed jb55 closed 2 weeks ago

jb55 commented 3 weeks ago

Looks like this is still an issue

danieldaquino commented 3 weeks ago

I will start from the premise that most lack of formatting on push notifications have the same root cause across all types of notifications (not just repost notifications)

danieldaquino commented 3 weeks ago

I am seeing two patterns of Ndb initialization error, usually when lmdb loads and memory maps the db file

I will experiment with setting a lower memory map size

danieldaquino commented 3 weeks ago

Now I am seeing some instances where it succeeds, and the mdb_env_open failed, error 12 error log appears 3 times before succeeding, which indicates it could allocate about 4GB at that time 🤔

danieldaquino commented 3 weeks ago

I can't repro the memory error any longer, I will investigate the other one

jb55 commented 3 weeks ago

we can try reducing the number of ingest threads to 1 or 0 to see if that fixes memory alloc issues, these are the only things that really allocate memory (for ingest queues).

jb55 commented 3 weeks ago

oh wait 12 is an lmdb message, so maybe extensions can't do large virtual memory mapping? hmm.

on damus we have a special large virtual memory entitlement, should make sure that we have that on the extension as well.

jb55 commented 3 weeks ago

I just read this in the LMDB docs:

int  mdb_env_create(MDB_env **env);

    /** @brief Open an environment handle.
     *
     * If this function fails, #mdb_env_close() must be called to discard the #MDB_env handle.

not sure if this is related, but we don't actually do that! So maybe subsequent smaller mapsize calls are failing when they don't need to (we keep reducing mapsize in a loop until it succeeds).

our default mapsize is 32GiB, which may be causing issues in the extension? We probably need it for larger databases though, so hopefully we can just turn on the same entitlement we use on the main app.

danieldaquino commented 3 weeks ago

our default mapsize is 32GiB, which may be causing issues in the extension? We probably need it for larger databases though, so hopefully we can just turn on the same entitlement we use on the main app.

It tries to reduce mapsize. In some of my experiments, lmdb succeeded at mapsize ~= 4GB. That is larger than the minimum 700MB mapsize specified in the current code. In those experiments, the notification succeeded formatting. (This is when I stopped being able to repro formatting issues with the app closed).

danieldaquino commented 3 weeks ago

on damus we have a special large virtual memory entitlement, should make sure that we have that on the extension as well.

Good call, however it does not seem to be available on the extension target. I see that capability option is available on the main app target, but it isn't there on the extension

danieldaquino commented 3 weeks ago

I tried to closely debug what happens when the notification extension tries to initialize lmdb while the app is running (i.e. In this comment I am talking about the other source of formatting issues, the "error 1")

Here is what seems to happen (to the best of knowledge) during lmdb initialization (Most interesting things happen at mdb_env_setup_locks(MDB_env *env, MDB_name *fname, int mode, int *excl)):

  1. Open the ndb lockfile successfully
  2. Tries to acquire some file lock (syscall fcntl with command F_SETLKW) on ndb lockfile
  3. Gets a shared (not exclusive) lock on ndb lockfile successfully
  4. Tries to memory-map ndb lockfile with a NULL pointer (maybe it's just trying to do a locking test, not really trying to access that memory)
  5. Memory-map syscall seems to fail with EAGAIN (Resource temporarily unavailable)
  6. Tries to initialize and open a named semaphore (sem_open) with name at env->me_txns->mti_rmname, appears to be a read semaphore
  7. This syscall fails with "EPERM Operation not permitted."
  8. lmdb cancels init because of syscall error

@jb55, you mentioned (and I also read in some places) that lmdb allows a single-writer but multiple readers, and that you have seen notedeck functioning with multiple processes accessing NostrDB without some mutex or concurrency coordinator, however I have seen some evidence that seems to indicate that lmdb uses different concurrency coordination mechanisms on Darwin/XNU (i.e. macOS + iOS), FreeBSD, and Linux.

This file mdb.c:

#if defined(__FreeBSD__) && defined(__FreeBSD_version) && __FreeBSD_version >= 1100110
# define MDB_USE_POSIX_MUTEX    1
# define MDB_USE_ROBUST 1
#elif defined(__APPLE__) || defined (BSD) || defined(__FreeBSD_kernel__)
# define MDB_USE_POSIX_SEM  1
# define MDB_FDATASYNC      fsync
#elif defined(ANDROID)
# define MDB_FDATASYNC      fsync
#endif

#ifndef _WIN32
#include <pthread.h>
#include <signal.h>
#ifdef MDB_USE_POSIX_SEM
# define MDB_USE_HASH       1
#include <semaphore.h>
#else
#define MDB_USE_POSIX_MUTEX 1
#endif
#endif

Indicates that Darwin/XNU/Apple platforms use MDB_USE_POSIX_SEM as a coordination mechanism, whereas Linux (presumably?) uses MDB_USE_POSIX_MUTEX. Depending on these constants, the lmdb init path seems to be very different.

Furthermore, I found this info on http://www.lmdb.tech/doc/index.html (Note: Not sure if this is official documentation):

On BSD systems or others configured with MDB_USE_POSIX_SEM, startup can fail due to semaphores owned by another userid.

Fix: Open and close the database as the user which owns the semaphores (likely last user) or as root, while no other process is using the database.

This message feels a bit ambiguous — Is the problem the user the process is running as, the fact that there are two processes accessing the database, or both? However, if I am interpreting this correctly, then it implies that two processes cannot access the same lmdb database at the same time on these platforms.

I am thinking of alternatives to this specific issue of NostrDB access while the app is open, and I think a good solution might be to implement some form of IPC between the app and the extension using stuff like Darwin Notifications (https://www.atomicbird.com/blog/sharing-with-app-extensions/), or some web API exposed on localhost? Perhaps the logic could be as follows:

  1. Attempt to connect to main app via some form of IPC to get NostrDB info from the main app (better, saves up resources)
  2. If that fails (e.g. Damus is backgrounded), attempt to initialize NostrDB on the extension

What do you think?

danieldaquino commented 3 weeks ago

I think a good solution might be to implement some form of IPC between the app and the extension

Alternatively we can move some of the formatting logic to the server-side as well

jb55 commented 3 weeks ago

interesting investigation! this lead me to this comment by @hyc

https://github.com/agisboye/SwiftLMDB/issues/23#issuecomment-1518677356

IPC and POSIX Semaphores and Shared Memory

Normally, sandboxed apps cannot use Mach IPC, POSIX semaphores and shared memory, or UNIX domain sockets (usefully). However, by specifying an entitlement that requests membership in an application group, an app can use these technologies to communicate with other members of that application group.

...

@hyc: It sounds to me like if you want this to work you're going to need to patch your own builds of LMDB to include your application group name. With that 31-byte limit I don't even see much point to making this settable via LMDB API.

jb55 commented 3 weeks ago

Sounds like we just need to patch the LMDB semaphore name to match our group container?

I wonder if this semaphore name change will allow us to not have to close the DB when it goes into the background as well. that would be amazing.

UPDATE: did not fix it unfortunately :( app still gets killed if the lockfile is present

jb55 commented 3 weeks ago

@danieldaquino can you please try:

I've added the virtual memory entitlement to our notification extension so that you can open lmdb, I've also applied the group name to the semaphore in LMDB, hopefully that fixes weird issues with that.

jb55 commented 3 weeks ago

I have the same PR for highlighter for testing:

jb55 commented 3 weeks ago

Pushed a new internal testflight but it doesn't look like

did anything

hyc commented 2 weeks ago

Furthermore, I found this info on http://www.lmdb.tech/doc/index.html (Note: Not sure if this is official documentation):

It is, but it's not necessarily the most recent. The docs are generated by doxygen from comments embedded in the LMDB source, thus you can always find the most accurate docs in your own LMDB source tree. Mainly just by reading lmdb.h.

On BSD systems or others configured with MDB_USE_POSIX_SEM, startup can fail due to semaphores owned by another userid.

Fix: Open and close the database as the user which owns the semaphores (likely last user) or as root, while no other process is using the database.

This message feels a bit ambiguous — Is the problem the user the process is running as, the fact that there are two processes accessing the database, or both? However, if I am interpreting this correctly, then it implies that two processes cannot access the same lmdb database at the same time on these platforms.

No. If multiprocess access wasn't supported on those platforms we would say it's not supported. The message is quite clear - there may be problems if the semaphore is owned by a different userid. This is because POSIX semaphores are controlled by Unix permission bits, and we shouldn't have to explain to you how the operating system you're using works. If the permission bits are too restrictive, then only processes from the same userid will have access. If all the processes are running as the same userid then there's no issue.

danieldaquino commented 2 weeks ago

Thank you very much @hyc for your help!

Furthermore, I found this info on http://www.lmdb.tech/doc/index.html (Note: Not sure if this is official documentation):

It is, but it's not necessarily the most recent. The docs are generated by doxygen from comments embedded in the LMDB source, thus you can always find the most accurate docs in your own LMDB source tree. Mainly just by reading lmdb.h.

Makes sense, I will keep this in mind.

On BSD systems or others configured with MDB_USE_POSIX_SEM, startup can fail due to semaphores owned by another userid.

Fix: Open and close the database as the user which owns the semaphores (likely last user) or as root, while no other process is using the database.

This message feels a bit ambiguous — Is the problem the user the process is running as, the fact that there are two processes accessing the database, or both? However, if I am interpreting this correctly, then it implies that two processes cannot access the same lmdb database at the same time on these platforms.

No. If multiprocess access wasn't supported on those platforms we would say it's not supported. The message is quite clear - there may be problems if the semaphore is owned by a different userid. This is because POSIX semaphores are controlled by Unix permission bits, and we shouldn't have to explain to you how the operating system you're using works. If the permission bits are too restrictive, then only processes from the same userid will have access. If all the processes are running as the same userid then there's no issue.

Interesting. In the case of iOS, some people have mentioned that all iOS apps use the same userid (1), and I have tested this (using iOS 17.6.1) and it seems to be the case for us (Both app and extension processes report to be using userid 501)

I tested @jb55's patch (https://github.com/damus-io/damus/pull/2410) which follows one of your past recommendations on this topic, and it seems to resolve the issue on my end.

I believe @jb55 might be facing another separate issue with similar symptoms. I will work with him to debug whatever issues remain

jb55 commented 2 weeks ago

btw it's not just me, vanessa has the same issues

jb55 commented 2 weeks ago

are you testing with your local dev build or testflight? I find when I'm connected with a debug session the sandbox behavior is completely different. An example of this is the lockfile. not closing the DB works great when debugger is connected, but when I run the app separately iOS will crash the app in the presence of a lockfile.

danieldaquino commented 2 weeks ago

are you testing with your local dev build or testflight?

In my previous comment I tested with a local dev build. However, I tested with the newest testflight build you pushed, and it also worked on all 3 app states (foreground, background, closed)

Procedure:

  1. Open the app
  2. On a separate device, send an event that triggers a push notification (DM and post mention)
  3. Check the push notification is received on the device under test and ensure that the profile names are formatted (indicates that extension was able to fetch info from NostrDB)
  4. Background the app by going to the home screen
  5. Repeat steps 2 and 3
  6. Close app altogether on the app switcher
  7. Repeat steps 2 and 3

Test 1 preconditions:

Test 1 results: Notifications display formatted profile names in all steps


Test 2 preconditions:

Test 2 results: Notifications display formatted profile names in all steps

jb55 commented 2 weeks ago

@alltheseas do you have issues with push notification formatting?

for context:

me: iphone 14 pro max (18.0) - 380 mb nostrdb vanessa: iphone 15 pro max (17.6.1) - 13gb nostrdb

jb55 commented 2 weeks ago

@alltheseas

jb55 commented 2 weeks ago

I believe this is fixed now