Nozbe / WatermelonDB

πŸ‰ Reactive & asynchronous database for powerful React and React Native apps ⚑️
https://watermelondb.dev
MIT License
10.46k stars 587 forks source link

Initial Sync Download #650

Closed PatrickLambert1101 closed 3 years ago

PatrickLambert1101 commented 4 years ago

I see in the docs the Sync API function pullChanges uses a fetch request to get the change in records

My question is how does this fetch request handle large amounts of data, for example if an existing users, gets a new device and has to download 500,000+ records linked to their profile? Is there a way of initialising the watermelon db from the remote DB and guarantee all the data is there?

delkant commented 4 years ago

Also how will this work with react dom vs react native?. In one hand with react native -- since it is a pre-built app -- if you have generic data (ie. for all users) you should be able to ship a pre-loaded db with the app (is there any documentation on this? ) but you don't have this option with a webapp, will that means we should be able to download a ready to use db from a remote location? how? or we will have to go record by record with the sync API?

sators commented 4 years ago

+1

SavioScaglioni commented 4 years ago

+1

Skullcan commented 3 years ago

I have a similar question/problem.

The initial sync download on my App is around 46mb of data. The download/parse of the pullChanges is ok. But the app crashes after the object is returned to the synchronize() method.

PS: This only happens on android emulator, but it renders the development impossible since I can't test new installations properly

Edit: tried to use _unsafeBatchPerCollection in the synchronize() method, but with no success.

radex commented 3 years ago

@Skullcan 46MB is quite a lot. There's many tricks you can use if you can't get this size down. _unsafeBatchPerCollection is one such trick. Another is to enable android:largeHeap="true" in your manifest

Skullcan commented 3 years ago

Hi @radex thanks for the quick response.

Yah, we are going to change our approach, probably.

I tried the tricks you mentioned with no success. I keep getting partially executed syncs.. next time we try to sync.. I get diagnostics errors and the app crashs again.

I was reading about Batch seeding in #410 and I think we should consider this to fix our problem. Especially because we can't reduce the size of the app data atm.

Any other insights would be appreciated.

Here's the scenario: Initial load is 46mb. The sync works on top end phones, iPhones and S10's i.e. I thought it crashed only in the emulator, but it's crashing on low end phones also. Tried android:largeHeap="true" and _unsafeBatchPerCollection, no success.

Thanks again.

radex commented 3 years ago

give me more information: what's the cause of crash, what's the memory stats before sync, during, at crash. maybe that will be enough for me to give you guidance.

Skullcan commented 3 years ago

give me more information: what's the cause of crash, what's the memory stats before sync, during, at crash. maybe that will be enough for me to give you guidance.

Hi @radex

Here's the infos you asked.

Cause of the crash as per android studio's logcat: OOM: [] reason = Max heap size was exceeded (1 from category: vm_allocate_category), numCollections = 920, heapSize = 538185728, allocated = 534975713, va = 348127232.

Memory usage in idle: avg 235mb Memory usage during sync: min 330mb, max 520mb Memory usage at crash: 450mb All the values are approximated and it's source is the android studio profiler.

Here's the end of the logcat from the moment watermelon started processing the sync info. logcat_crash.txt

Don't know if all this will be helpfull.. I'm working at some kind of workaround this rn.

Thanks again for your time.

Edit: Forgot to mention the emulator I used for testing purpose is Android Studio Nexus 5 with it's default configuration.

radex commented 3 years ago

try tweaking this number down and see if that's enough to avoid the crash:

https://github.com/Nozbe/WatermelonDB/blob/master/src/sync/impl/applyRemote.js#L242

Skullcan commented 3 years ago

Hi, thanks for your suggestion. Sadly it not worked :(

I tried to lower the number per batch progressively till 500... The result is always the same. The debug build, after roughly 20-22k records are processed, crashes with the same error mentioned above.. Tried other Emulators too. But it had no success either on Pixel 3, Nexus, and the same using Genymotion emulators.

I then tested an internal production build in real devices. The change helped a bit. Albeit very slow, it works now on phones that was crashing previously.

radex commented 3 years ago

I have some ideas how to dramatically improve this situation, but since we're not running into this limitation at the moment, I probably won't implement it in the very near future. But in the meantime, consider hacking the initial sync a little bit by passing data in chunks -- i.e. first sync only saves part of the data, then run it a few more times, until you have all the data.

If you have the skills to get your hands dirty with C++ and JNI on Android, then you could also help me finish Android JSI integration, which would be the biggest step towards permanently fixing this situation

Skullcan commented 3 years ago

Alright, I understand and appreciate your help in this case.

Unfortunately I don't have the skills nor the time available to help with this.

I'll do what you suggested tho, I think that's the best approach atm.

Thanks very much for your time.

We've been using πŸ‰ for about 1 year now, and it's very nice and easy to work with. Keep it up!

MikeGlennFWC commented 3 years ago

Could we do something along the lines of chunking. For example on an initial sync, or a sync after a long period between lastPulledAt and now.

We use a configurable max "time chunk" size, say 30 seconds. So then we can offset through the results until we get to now, you could add something to the pull response to say "hey, keep fetching, theres more!". Once the lastPulledAt is within 30 seconds, the final response doesn't have the repull key.

This would need to be configurable, as you should know roughly how much data you can reasonably/safely chunk by time.

Another key point here is that if lastPulledAt is sent alone, just send everything, do not limit, which should keep compatibility of current sync backend impl's.

radex commented 3 years ago

@MikeGlennFWC I don't think there's anything stopping you from implementing that with the current version of Watermelon. You can have the backend return a timestamp that's significantly older than current timestamp (or some extra field), and a bit of app-level logic to detect this and run Watermelon's synchronize() again. Should be pretty easy to do, actually!

I don't necessarily recommend doing that (it might be difficult to get this to work correctly in all edge cases, and the default implementation and backend specification of Watermelon sync takes correctness as its first priority, even at the expense of performance)

MikeGlennFWC commented 3 years ago

Yes i think you are right, i think it can be done without any changes to Watermelon. It's something i've been thinking about as i'm using Watermelon more and more, it may never be a problem for me so I may never have the need to solve, but it's on my radar as a potential pitfall. If i do cross that bridge though i will update this thread.

radex commented 3 years ago

Yes, please do. If you run into some specific issues that you can solve, that might be helpful to others if they need to do this.

Hopefully though, Watermelon's progress with performance will mean that such tricks won't be necessary for users

hanyang7427 commented 3 years ago

Alright, I understand and appreciate your help in this case.

Unfortunately I don't have the skills nor the time available to help with this.

I'll do what you suggested tho, I think that's the best approach atm.

Thanks very much for your time.

We've been using πŸ‰ for about 1 year now, and it's very nice and easy to work with. Keep it up!

I'm finding the solution of sync between watermelon and mssql database, can you tell me what sync backend server you use?

MikeGlennFWC commented 3 years ago

Hello again,

I finally took the leap to find a solution for this. After seeding a lot data into my server, i ran into an issue where the initial sync was exhausting the memory on the server, as i expected to happen at some point, the longer time goes on and fresh devices are added to the system and need to sync from the beginning this initial sync is going to get bigger and bigger.

I could have increased the memory allocation limit on the server, but this only moves the problem to a later point in time

My first idea was to limit the sync size to a number of versions (or timestamps as per the default implementation) so it chunked the data. Once i tried this i quickly realised that i cannot guarantee how much data is between the 2 timestamps/versions as the data may or may not have been added to certain tables, therefore impossible to gauge.

What actually happened was the sync was taking forever as it paged through the versions. If i increase the page size, i run the risk of pulling too much data. After an hour of cycling through versions i came up with the below approach which i think is FAR better.

So the core of the problem is that there is a risk of passing too much data between a server and a device, so i decided it would be better to calculate the size of the data between 2 versions and store this in a table.

I created a version_chunks table that looks like this

(please note, my sync protocol closely mimcs Fahri's approach described here https://fahri.id/posts/how-to-build-watermelondb-sync-backend-in-elixir/ )

lastest version = 11234 version_from version_to size
0 1232 1048234
1233 3843 1048876
3844 10000 1052896

So now when a fresh device comes to the party, it's version will be 0, so before we fetch the changes from all the tables, we lookup the version in the version_chunks table where the version number is between version_from and version_to, if a chunk exists, then we can set the end version to the versionto of the chunk and the start_ would just be what the device's last version/timestamp was.

Because this means the sync is not a full sync and only a partial sync, i decided to add a flag like so return ['changes' => [ ** changes ** ], 'timestamp' => 1232, 'partial' => true];

If the device see's there is a partial flag of true, then it will trigger a repeat sync job, and repeat that until partial becomes false at which point the device will be up to date.

in the case above it would be return ['changes' => [ ** changes ** ], 'timestamp' => 3843, 'partial' => true]; return ['changes' => [ ** changes ** ], 'timestamp' => 10000, 'partial' => true];

then because the pages are only up to version 10000, the actual atomic version number is 11234, the last sync would be like

return ['changes' => [ ** changes ** ], 'timestamp' => 11234, 'partial' => false]; <-- partial false, up to date

finally, the tricky part is populating the pages, currently the way i do this which is quite intensive, is by incrementally cycling through the versions and finding the size in bytes of the json object.

So i would start with 0 - 100 = 1234 bytes then 0 - 200 = 5432 bytes then 0 - 300 = 8765 bytes and so on. once we break the target of 1mb (1048576 bytes) we store the version_from and version_to in the version_chunks table and do this again until we are close to the latest version number. This job can run periodically, but will be very intensive on a database, i'd probably recommend this be done on a read replica if the constant selects would cause a performance issue on production.

So now when the sync happens it cycles through 1mb pages until it reaches the most up to date version. I was able to successfully synchronise the exact data set that was failing before.

The above approach will also work if there has been a device that has not seen the server in a long time, it's timestamp may be 3900, so then it can fetch from the closest page and run partial syncs from there until it is up to date. in the above example the fetch would be 3900 - 10000 (not 3844 - 10000). then finally 10000 - 11234 (up to date)

Please feel free to let me know your thoughts on this approach, i'm interested to see if it could be improved.

Thanks Mike

radex commented 3 years ago

I could have increased the memory allocation limit on the server, but this only moves the problem to a later point in time

@MikeGlennFWC I'm curious (and sorry if you already said this before), how much data are we talking about? I find it somewhat difficult to believe that an amount of data that would be reasonable to download to a mobile device would be a problem for a backend server, if properly engineered. Admittedly, it's much easier to develop something that works than something that's very fast. (Considering, for example, that Watermelon Sync had probably more than an order of magnitude perf improvement, but that took many iterations and rewrites...)

MikeGlennFWC commented 3 years ago

Yes you are exactly right, depending on a number of factors though. So the default memory limit in PHP is 128mb, the amount of data was less than that, but there is the rest of the request stack to consider as well.

Increasing this limit pushes the problem to the device. Which in production I would hate to hit this issue.

Usually if I hit this default memory limit, a bandage is to increase the limit, a solution is to reduce the amount of memory used per request.

Which is the idea of chunking into relatively fixed chunk sizes.

radex commented 3 years ago

@MikeGlennFWC Have you considered other ways to reduce memory usage? Or, have you looked into what's the memory usage behavior? If memory usage is high during JSON generation, you could stream a response - read the database as you're writing JSON, and pushing it in the response - instead of generating one big PHP structure and then serializing it. I'm not sure if this approach is viable in your case, but it could be much simpler and more reliable with the same effect (low peak memory usage regardless of request size)

MikeGlennFWC commented 3 years ago

@radex hehe, I was looking into doing exactly that, i looked up Transfer-Encoding: chunked but then i conceded when i realised that it's not possible with PHP's built in json_encode - as this would load the entire object into memory.

Thinking about it again, of course this is absolutely possible by iteratively processing the data and building the object manually, given the depth of the object is known.

Looks like it's already been done here: https://github.com/violet-php/streaming-json-encoder

Requesting the dataset from the database is potentially going to be an issue at some point, albeit the far more efficient on memory usage.

Theoretically speaking, if we stream the object from the server, and the object was 500mb in total, how does WatermelonDB handle the data? Would it load the entire object into the devices memory to perform the sync?

I guess the fetch api would load this into memory before WatermelonDB does, my knowledge on how the memory allocation isn't great, i imagine the memory would be duplicated and GC'd later on whilst passing from fetch -> πŸ‰.

radex commented 3 years ago

Requesting the dataset from the database is potentially going to be an issue at some point, albeit the far more efficient on memory usage.

Probably, although that depends on the design of your backend and how it communicates with the database...

Theoretically speaking, if we stream the object from the server, and the object was 500mb in total, how does WatermelonDB handle the data? Would it load the entire object into the devices memory to perform the sync?

Yes, it would - however, receiving a streaming json response isn't out of the question for the future

BruceSuperProgramer commented 3 years ago

Hi guys, I have a question about the sync function. Is the sync function will have some possibility to block the react-native UI thread? Or these processes completed entirely on the native side Nothing transfer between the RN bridge and no chance to make the screen become unresponsive?

radex commented 3 years ago

@BruceSuperProgramer sync does partially block UI thread, though how much depends on a lot of details. Generally, the stutter during normal incremental sync updates should be very small. However, for initial sync (login), I have a completely new solution which you can test on 0.23 prereleases:

https://github.com/Nozbe/WatermelonDB/pull/1067/files#diff-4375a114a579f6547acd4173ef421f30cb5742b1c3091c3168928d8a514da74dR146

With Turbo Login, there is virtually no work being done either on main (native UI) or JavaScript threads, it's all separate thread and very fast (up to 5x faster than current implementation)

BruceSuperProgramer commented 3 years ago

Hi @radex Much appreciate for quick reply will give it a try

BruceSuperProgramer commented 3 years ago

Hi, @radex I tried installed pre-release v0.23.0-10 and v0.23.0-9. But I have encountered some error which says module not found, it throws this type of error on both of them, I remembered it also complaining @nozbe/watermelondb/sync not found in V0.29-10. The picture attached below tests against V0.23.0-9. Do you know the reason for it? I have reinstalled node_modules and Pods. Screen Shot 2021-06-23 at 12 01 22 am Screen Shot 2021-06-23 at 12 02 02 am

radex commented 3 years ago

this is wrong, git-based package declaration won't work, use the version released on NPM: https://www.npmjs.com/package/@nozbe/watermelondb - 0.23.0-10

BruceSuperProgramer commented 3 years ago

Hi @radex, thanks for guiding me on right track, will try it later.

BruceSuperProgramer commented 3 years ago

Hi @radex, putting the initial load into another thread is exciting news I feel it will avoid UI lag issues caused by a large amount of DB insertion on the initial load. Do you have a confirmed or approximate date when pre-release V23 will become stable? Β πŸ₯³

radex commented 3 years ago

@BruceSuperProgramer I don't promise dates, but weeks, not months. We use pre-releases on production at Nozbe if that gives you any confidence, and we're currently testing turbo login internally - slated to ship in our app in about 2 weeks if no more issues are discovered

BruceSuperProgramer commented 3 years ago

Hi, @radex I tried 0.23-10 and 0.23-8. I am unable to insert data into the database due to an error shows in the picture. Do you know what causes it? Screen Shot 2021-06-23 at 11 37 13 pm Screen Shot 2021-06-23 at 11 52 15 pm

I think I just figured it out. I forgot to include the timestamp field in my API response body. πŸ˜„

KrisLau commented 2 years ago

@BruceSuperProgramer After syncing, my πŸ‰ tables' timestamps are set to 0.0, did you ever run into that and how did you resolve it?

All my timestamp fields are declared as number types and with the @date decorator in their respective models. On my backend, I'm using Postgres, and ObjectionJS which returns the timestamps like this "created_at": "2021-12-23T15:16:10.388Z" in the changes object when I logged it from the frontend.

EDIT: ~Fixed I guess RN just wasn't updating with my latest changes~ So weird, so it seems like it is getting the right timestamp the first sync but after the [Diagnostic error: [Sync] Server wants client to create record table_name#id1234567890123, but it already exists locally. This may suggest last sync partially executed, and then failed; or it could be a serious bug. Will update existing record instead.] error the timestamps get messed up

BruceSuperProgramer commented 2 years ago

@KrisLau sorry,it has benn long time since I post this question. But from my post looks like I included create_at in my API response. From my memory, water DB API response has to follow a certain format. You better check out their documentation again make sure the data returned from API comply with their pattern.

ChrisSimoniMBT commented 1 year ago

For anyone stumbling across this for the main issue of chunking data, my initial syncs can be 100MB+ of JSON. When parsed in RN it causes an OOM crash.

I solved this by reimplementing just the synchronize function and specifically just the pullChanges portion of it.

On the server side I check the total size of the JSON string, if it's over a certain size threshold I start chunking it out. I start chunking by collections and if a single collection is too large then I chunk by created, updated and deleted fields. If those are still too large I send a max of 25 items at a time from the field being chunked.

In the synchronize function you will see that after pull changes is called, the changes and timestamp are actually acted on separately. This allows us to create an async queue that requests our initial sync over a web socket connection. This also attaches a listener to web socket syncChunk or similarly named. Each chunk is sent from server to client with syncChunk event. The event feeds it to the AsyncQueue which then feeds it to my custom synchronize function in an for await (...) function. We apply that chunks changes same as the original sync code does. When we are done the server sends syncFinished event which aborts the async queue and continues the rest of the synchronize function as normal.

Overview: In Server

  1. Implement startSync web socket listener
    1. retrieves last_pulled_at, etc values.
    2. Retrieves the ENTIRE SyncPullResult object as if we were going to send it all in one response. This is too ensure that we really only have the data relative to the timestamp even if our chunking operation may take awhile.
    3. Chunks out if JSON object stringified is > 25 MB and chunks even more if collections are > 25 MB, chunks even more if created/updated/deleted arrays are > 25 MB.
    4. Emits chunks as SyncPullRequest objects with their timestamps grabbed from the main object that we are chunking using syncChunk event.
    5. When everything is chunked out and sent emits syncFinished event to let client know nothing is left.

Client: Not going to go into as much detail

  1. Implement custom synchronize function that uses an AsyncQueue and a for await...of loop to retrieve your chunks.
    • I create my queue and start it in the synchronize function then go straight to the for await of loop to retrieve my chunks.
  2. In the loop set newLastTimestamp value to timestamp and applyRemoteChanges the changes object.

Before this, my changes object was ~100MB and would cause RAM usage in expo dev client to hit ~1200MB before OOM crash. Now my max chunk sizes either 25MB or less and max RAM usage according to expo dev client is ~700MB spike for a second or less before settling back down.

This could be improved more either doing artificial waits on the server, sending a chunkProcessed event to server to tell it to send us our next chunk or other strategies I'm sure based on my understanding that WebSockets read and put the data into a queue on the client to be acted upon the moment it is received.

I'm not sure on how "proper" this approach is but I don't have more time to get a better implementation so this is what I'm rolling with for my chunking strategy.

Initial sync size can't be negotiated as we need all data and images related to specific projects available for offline use at a moments notice the second we fully load into the project.

Sizes referred to above are when the data is in stringified JSON format.

celciusjj commented 5 months ago

For anyone stumbling across this for the main issue of chunking data, my initial syncs can be 100MB+ of JSON. When parsed in RN it causes an OOM crash.

I solved this by reimplementing just the synchronize function and specifically just the pullChanges portion of it.

On the server side I check the total size of the JSON string, if it's over a certain size threshold I start chunking it out. I start chunking by collections and if a single collection is too large then I chunk by created, updated and deleted fields. If those are still too large I send a max of 25 items at a time from the field being chunked.

In the synchronize function you will see that after pull changes is called, the changes and timestamp are actually acted on separately. This allows us to create an async queue that requests our initial sync over a web socket connection. This also attaches a listener to web socket syncChunk or similarly named. Each chunk is sent from server to client with syncChunk event. The event feeds it to the AsyncQueue which then feeds it to my custom synchronize function in an for await (...) function. We apply that chunks changes same as the original sync code does. When we are done the server sends syncFinished event which aborts the async queue and continues the rest of the synchronize function as normal.

Overview: In Server

  1. Implement startSync web socket listener

    1. retrieves last_pulled_at, etc values.
    2. Retrieves the ENTIRE SyncPullResult object as if we were going to send it all in one response. This is too ensure that we really only have the data relative to the timestamp even if our chunking operation may take awhile.
    3. Chunks out if JSON object stringified is > 25 MB and chunks even more if collections are > 25 MB, chunks even more if created/updated/deleted arrays are > 25 MB.
    4. Emits chunks as SyncPullRequest objects with their timestamps grabbed from the main object that we are chunking using syncChunk event.
    5. When everything is chunked out and sent emits syncFinished event to let client know nothing is left.

Client: Not going to go into as much detail

  1. Implement custom synchronize function that uses an AsyncQueue and a for await...of loop to retrieve your chunks.

    • I create my queue and start it in the synchronize function then go straight to the for await of loop to retrieve my chunks.
  2. In the loop set newLastTimestamp value to timestamp and applyRemoteChanges the changes object.

Before this, my changes object was ~100MB and would cause RAM usage in expo dev client to hit ~1200MB before OOM crash. Now my max chunk sizes either 25MB or less and max RAM usage according to expo dev client is ~700MB spike for a second or less before settling back down.

This could be improved more either doing artificial waits on the server, sending a chunkProcessed event to server to tell it to send us our next chunk or other strategies I'm sure based on my understanding that WebSockets read and put the data into a queue on the client to be acted upon the moment it is received.

I'm not sure on how "proper" this approach is but I don't have more time to get a better implementation so this is what I'm rolling with for my chunking strategy.

Initial sync size can't be negotiated as we need all data and images related to specific projects available for offline use at a moments notice the second we fully load into the project.

Sizes referred to above are when the data is in stringified JSON format.

Some code or guide tutorial for it, or any more explained. Is so interesting we have the same problem.

celciusjj commented 4 months ago

We have implemented something similar in pieces, but when doing the applyChanges, sometimes it collapses and other times it doesn't. It still continues to break the memory of some cell phones even though the piece is small, do you have any idea why?

isaachinman commented 1 week ago

Hi all – I'mΒ new to Watermelon. Have been playing with it for a few weeks.

As I've built out our app, our initial sync sizes have grown to 50MB+.

I appreciate the detailed responses above. However, I'm not sure I want to go as far as chunking data over websockets. Perhaps I am missing something obvious – can't this be done with a straightforward loop?

Here is an example of what I have implemented:

import { synchronize } from '@nozbe/watermelondb/sync'

export const syncDatabase = async () => {
  let page = 1
  let hasMore = true

  let startTimestamp: number | undefined | null = null
  let dataTimestamp: number | null = null

  while (hasMore) {
    await synchronize({
      database: appDatabase,
      migrationsEnabledAtVersion: 1,
      pullChanges: async ({
        lastPulledAt,
        migration,
        schemaVersion,
      }): Promise<SyncPayload> => {
        if (page === 1) {
          startTimestamp = lastPulledAt
        }

        const { data } = await apiAxios<ClientDB.SyncPayload>({
          method: 'GET',
          params: {
            lastPulledAt: startTimestamp,
            migration,
            page,
            schemaVersion,
          },
          url: '/v1/client-db',
        })

        if (page === 1) {
          dataTimestamp = data.timestamp
        }

        if (data.hasMore === false) {
          hasMore = false
        }

        return {
          changes: data.changes,
          timestamp: dataTimestamp as number,
        }
      },
      pushChanges: async data =>
        apiAxios({
          data,
          method: 'POST',
          url: '/v1/client-db',
        }),
    })

    page += 1
  }
}

Basically, we paginate the sync endpoint and preserve the initial lastPulledAt timestamp to ensure we set each page with the same lastPulledAt value, just incrementing page.

So far, this seems to be working well.

Of course, this does not handle failures yet. I will need to maintain a global state and queue for paginated sync runs, to be able to resume at failure points, etc.

It really feels like Watermelon should support chunking/pagination out of the box, as all "serious" users will likely face this issue.