gauteh / lieer

Fast email-fetching and sending and two-way tag synchronization between notmuch and GMail
http://lieer.gaute.vetsj.com
Other
503 stars 61 forks source link

Speed up initial download #40

Closed manishrjain closed 6 years ago

manishrjain commented 6 years ago

My personal email goes back over 10 years, and I'd rather not have all of that downloaded on my laptop (even setting the max messages per label in Gmail doesn't help with that, because I have gained way too many labels over the years). offlineimap allows you to sync mails from certain 'folders'. Would be good if gmailieer allows that.

gauteh commented 6 years ago

The mail that should be synced could perhaps be limited by speifying a query (https://github.com/gauteh/gmailieer/blob/master/lieer/remote.py#L64). The problem is that e.g. history().list(..) does not take a query, so things would get messy when doing a partial update. There are probably other methods that should be limited to the query as well.

Note that once you have all your emails synced the partial update will be fast. Is this in order to save disk space?

-g

Manish R Jain writes on september 9, 2017 8:21:

My personal email goes back over 10 years, and I'd rather not have all of that downloaded on my laptop (even setting the max messages per label doesn't help with that, because I have gained way too many labels over the years). offlineimap allows you to sync mails from certain 'folders'. Would be good if gmailieer allows that.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/gauteh/gmailieer/issues/40

manishrjain commented 6 years ago

Disk space isn't that much of a concern (the number of files might be, but I don't know because I haven't yet done a full sync -- on a tangential note, would have been awesome if one could just use a key-value store instead of files to store mails). But, just the process of downloading a decade of emails on your laptop is really slow, and very resource heavy. Also, seems a bit pointless, because when on laptop, you just want to get your inbox to zero, not much else. Also, offlineimap does this out of the box.

I can see why partial sync would be tricky -- IIUC, it would require keeping an external state of all the messages that you had previously downloaded but don't belong to any of the labels in sync config anymore. So, the right question is, does the time it takes to get up and running with Gmailieer and Astroid on a laptop outweigh the complexity of doing partial syncs? I'd say yes.

gauteh commented 6 years ago

Manish R Jain writes on september 11, 2017 2:59:

Disk space isn't that much of a concern (might be, but I don't know because I haven't yet done a full sync). But, just the process of downloading a decade of emails on your laptop is really slow, and very resource heavy. Also, seems a bit pointless, because when on laptop, you just want to get your inbox to zero, not much else. Also, offlineimap does this out of the box.

I can see why partial sync would be tricky -- IIUC, it would require keeping an external state of all the messages that you had previously downloaded but don't belong to any of the labels in sync config anymore. So, the right question is, does the time it takes to get up and running with Gmailieer and Astroid on a laptop outweigh the complexity of doing partial syncs? I'd say yes.

When I said partial sync, incremental synchronization might have been more accurate. This is what happens automatically if you have a recent enough state file (have done partial pulls with not too long intervals).

For me, keeping my entire mail archive locally, indexed by notmuch and instantly available (usually searchable faster than through GMail), is a big point to using notmuch. Your inbox will be synced with GMail anyway (if you use gmailieer). Anyway, I certainly respect other use cases, my point is: Once you have everything down, partial (incremental) syncs are really fast, and you will never think about it again. Then it is just a huge advantage to have all your email history instantly searchable.

With notmuch + gmailieer you can even clean up your label-mess in a scriptable fashion.

I'm not entirely opposed to not syncing the full archive, but as you point out it will add a significant layer of complexity:

In short; please try and see how it feels after you have done a full sync. I do not think I have time to add this functionality, but I'd support a good implementation of it if you submit a PR.

manishrjain commented 6 years ago

So, I'm doing the mail sync on desktop. Started it in the morning (Sydney time) and it's going on. Got close to 140K emails, and it's currently at 10K, going at the rate of 2.5s/it (started with it/s, but then switched). So, this would take at least a day to finish (based on current ETA this would be 3.7 days). Fast.com shows my network bandwidth to be at 310 Mbps, so that's pretty fast; the mail download is slow.

I doubt I'll have time to submit a PR -- already running a bunch of OSS projects. Most likely, I'll just work around the problem by just letting it run.

I think if there's a way to speed up the email download, that'd alleviate this issue as well. I like the idea of having the entire mail on laptop, but the download speed makes this hard.

gauteh commented 6 years ago

Manish R Jain writes on september 11, 2017 8:31:

So, I'm doing the mail sync on desktop. Started it in the morning (Sydney time) and it's going on. Got close to 140K emails, and it's currently at 10K, going at the rate of 2.5s/it (started with it/s, but then switched). So, this would take at least a day to finish. Fast.com shows my network bandwidth to be at 310 Mbps, so that's pretty fast; the mail download is slow.

That amount of e-mail shouldn't be a problem, but I see that this is getting to become a big problem for initial users. I doubt that there really is a way around this on one API key, as I'm understanding GMail try to limit full downloads.

I doubt I'll have time to submit a PR -- already running a bunch of OSS projects. Most likely, I'll just work around the problem by just letting it run.

Ditto. Plus just got a kid ;) That's why I'm reluctant to add that layer at the moment.

I think if there's a way to speed up the email download, that'd alleviate this issue as well. I like the idea of having the entire mail on laptop, but the download speed makes this hard.

Might be a throttling issue then, you could try to set up your own API key: instructions in the README. The public one receives 100k-1m
requests / month (especially during new syncs).

gauteh commented 6 years ago

You could try out the --limit option just to see how things work, but remember to do a complete, full, sync before you do any pushing. It's really only designed for debugging, so there might be some weird side-effects.

manishrjain commented 6 years ago

Using my own API key doesn't help. Still reduces the batch req size to 100. Though, I might be doing something wrong, and my API might not be getting picked up -- I don't know. No way to tell from the log output.

The download is hovering around 1.5s/it -- if there's a way to improve this, that'd be awesome.

gauteh commented 6 years ago

Manish R Jain writes on september 11, 2017 11:24:

Using my own API key doesn't help. Still reduces the batch req size to 100. Though, I might be doing something wrong, and my API might not be getting picked up -- I don't know. No way to tell from the log output.

Did you re-auth with your new key? refer to the -h output and README for more info.

manishrjain commented 6 years ago

3rd day of syncing, and only at 70K emails. The batch size reduced to 50 at some point.

content:  53%|█████████████▊            | 74567/139771 [46:59:51<44:06:18,  2.44s/it][

I think the instructions in documentation to get the API aren't clear. Google console asks for multiple options, so I chose the options which looked the best. And got a client_id.json (not client_secret.json) file. I'll try using it again (this time with personal email).

Another thing I noticed is that I'm unable to gmi push from my work folder (which has already synced), because it somehow picks up emails from personal folder as well. So, I'm unable to use astroid even just for work until personal emails are all synced up.

manishrjain commented 6 years ago

Also, gmi should indicate which client id it is using, so a user can at least confirm that he's on the right one. Would be even better if it writes a warning when using the generic public client, with limitations on number of requests, etc.

manishrjain commented 6 years ago
pull: full synchronization (no previous synchronization state)
fetching messages: 139814it [11:37, 200.39it/s]                                                                                                                receiving content:   0%|▏                                                                                               | 117/65192 [04:46<44:13:39,  2.45s/it]reducing batch request size to: 100

So, even after passing in the client_id.json like this to both pull and to auth, I still get the reducing batch req size to 100 issue.

julian-klode commented 6 years ago

"You're limited to 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests." -- https://developers.google.com/gmail/api/guides/batch

I think the recommendation was to use 50 calls per batch request, even, to reduce throttling (the less calls per batch request, the less throttling).

julian-klode commented 6 years ago

The fixed limit of 100 might be new, and the 50 might be the old "soft" limit.

julian-klode commented 6 years ago

"Using batching is encouraged, however, larger batch sizes are likely to trigger rate limiting. Sending batches larger than 50 requests is not recommended." -- https://developers.google.com/gmail/api/v1/reference/quota

Hence it would make sense to use batch sizes of 50, if that's the recommended approach. When doing full syncs, I noticed performance to be bad until it went down to 50, but that took some time.

manishrjain commented 6 years ago

So, 50-100 requests per batch makes sense. But, how many batches are going on concurrently? The page doesn't say anything about the number of concurrent batches allowed. If they support it, then doing so would really improve the throughput.

My mail size so far is only 4GB, but it has taken 3 days of downloading to reach it -- that's way too slow for today's standards, particularly when using Gmail's APIs directly.

gauteh commented 6 years ago

Julian Andres Klode writes on september 13, 2017 2:36:

"You're limited to 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests." -- https://developers.google.com/gmail/api/guides/batch

I think the recommendation was to use 50 calls per batch request, even, to reduce throttling (the less calls per batch request, the less throttling).

We should probably reduce the default limit to 50 then. Thanks for digging this up. I assume they do not encourage concurrent requests, that would defeat the purpose of a limit.

gauteh commented 6 years ago

Manish R Jain writes on september 13, 2017 3:50:

So, 50-100 requests per batch makes sense. But, how many batches are going on concurrently? The page doesn't say anything about the number of concurrent batches allowed. If they support it, then doing so would really improve the throughput.

My mail size so far is only 4GB, but it has taken 3 days of downloading to reach it -- that's way too slow for today's standards, particularly when using Gmail's APIs directly.

If you have any practical suggestions on how to speed up things with google, please let me know.

manishrjain commented 6 years ago

I assume they do not encourage concurrent requests, that would defeat the purpose of a limit.

I'm not so sure. Batching up requests in a single network call makes a lot of sense. Every network call has overhead, and doing batching amortizes that. But, that doesn't mean running multiple batches concurrently is a no-go. In fact, in Dgraph, we encourage our users to batch up mutations, and also run as many batches concurrently as possible.

Google doc is silent about running batches concurrently -- so this might be worth trying out. I think the rate at which Gmailieer is syncing isn't fast at the moment, so if doing multiple batches improves that rate, that'd be a good thing for adoption.

gauteh commented 6 years ago

Is the google api client library thread safe?

gauteh commented 6 years ago

Manish R Jain writes on september 13, 2017 1:58:

Also, gmi should indicate which client id it is using, so a user can at least confirm that he's on the right one.

If you used gmi auth -c path/to/your/client_secret.json correctly you would get a message that the user-provided API id and secret is used. It is not necessary to use -c with pull or push unless re-authorization is required (by google api). Only gmi auth -c .. will remove the existing authorization tokens, allowing you to manually re-authorize with a different client id/secret.

Have a look in remote.py:366.

Note that if your authorization expires for some reason, you need to re-supply your own API key using gmi auth -c .., otherwise you will be prompted to re-authorize the standard client id/secret.

gauteh commented 6 years ago

Manish R Jain writes on september 13, 2017 1:56:

Another thing I noticed is that I'm unable to gmi push from my work folder (which has already synced), because it somehow picks up emails from personal folder as well. So, I'm unable to use astroid even just for work until personal emails are all synced up.

You probably have messages that are present in both accounts, this works, but will cause tags/labels to be synced to both accounts. Open a new bug if there is something specific failing.

manishrjain commented 6 years ago

Okay, finished on the 5th day, with this stack trace.

/data/Mail/personal/mail/cur/15d6e71917791d47:2,S is not an email
receiving metadata:   5%|████████████▋                                                                                                                                                                                                                                                    | 3699/74622 [06:44<16:46, 70.45it/s]remote: could not find remote message: 15ddf5d8ec3b6a21!
receiving metadata:  20%|███████████████████████████████████████████████████▊                                                                                                                                                                                                           | 15174/74622 [11:33<07:49, 126.62it/s]remote: could not find remote message: 15de46a60447183e!
receiving metadata:  21%|█████████████████████████████████████████████████████▊                                                                                                                                                                                                          | 15672/74622 [11:46<16:25, 59.81it/s]remote: could not find remote message: 15de491dbeaef465!
receiving metadata:  24%|█████████████████████████████████████████████████████████████▌                                                                                                                                                                                                  | 17935/74622 [12:41<12:48, 73.76it/s]Traceback (most recent call last):
  File "/usr/bin/gmi", line 4, in <module>
    __import__('pkg_resources').run_script('gmailieer==0.2', 'gmi')
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 742, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1510, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/EGG-INFO/scripts/gmi", line 8, in <module>
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/gmailieer.py", line 136, in main
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/gmailieer.py", line 307, in pull
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/gmailieer.py", line 531, in full_pull
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/gmailieer.py", line 562, in get_meta
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/remote.py", line 100, in func_wrap
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/remote.py", line 271, in get_messages
  File "/usr/lib/python3.6/site-packages/oauth2client-4.1.2-py3.6.egg/oauth2client/_helpers.py", line 133, in positional_wrapper
  File "/usr/lib/python3.6/site-packages/google_api_python_client-1.6.3-py3.6.egg/googleapiclient/http.py", line 1464, in execute
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/remote.py", line 251, in _cb
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/gmailieer.py", line 560, in _got_msg
  File "/usr/lib/python3.6/site-packages/gmailieer-0.2-py3.6.egg/lieer/local.py", line 326, in update_tags
lieer.local.RepositoryException: tried to update tags on non-existant file: /data/Mail/personal/mail/cur/14fbf1e049ce7d7c:2,

Re-running pull just so it can finish up the initial sync without error. The fetching messages stage takes around 12 mins each time. I hope that after the initial sync is successfully done, a pull wouldn't take that long.

P.S. Final mail size 7.4GB, ~160K emails.

manishrjain commented 6 years ago

Re: thread safety of the python client library, a Google search shows this: https://developers.google.com/api-client-library/python/guide/thread_safety

You just need to give a unique http connection per thread -- which is reasonable. If that helps increase the rate of download, it would be a huge win.

gauteh commented 6 years ago

Manish R Jain writes on september 15, 2017 1:34:

Re-running pull just so it can finish up the initial sync without error. The fetching messages stage takes around 12 mins each time. I hope that after the initial sync is successfully done, a pull wouldn't take that long.

It will, but consider that you just fetched the full list of e-mails: 160K. Agreeably, not much data, but its a long list.

gauteh commented 6 years ago

Manish R Jain writes on september 15, 2017 1:36:

Re: thread safety of the python client library, a Google search shows this: https://developers.google.com/api-client-library/python/guide/thread_safety

You just need to give a unique http connection per thread -- which is reasonable. If that helps increase the rate of download, it would be a huge win.

Yeah, if google accepts multiple threads we should try that!

manishrjain commented 6 years ago

It will, but consider that you just fetched the full list of e-mails: 160K. Agreeably, not much data, but its a long list.

12 mins of sync to get a new email, ahem.. I'm not sure if that's practical.

gauteh commented 6 years ago

Manish R Jain writes on september 15, 2017 9:29:

It will, but consider that you just fetched the full list of e-mails: 160K. Agreeably, not much data, but its a long list.

12 mins of sync to get a new email, ahem.. I'm not sure if that's practical.

This is not partial sync: please refer to previous e-mails.

ASzc commented 6 years ago

I just had an achingly slow 6 day initial sync for 3.3 GB of mail. I was using my own API key.

$ gmi sync
push: everything is up-to-date.
pull: full synchronization (no previous synchronization state)
fetching messages: 175158it [04:14, 689.26it/s]                                                                                         
receiving content:  48%|███████████████████████████████▊                                   | 83230/175158 [46:54:56<70:40:57,  2.77s/it]reducing batch request size to: 25
receiving content:  50%|█████████████████████████████████▎                                 | 87027/175158 [50:04:27<83:20:56,  3.40s/it]reducing batch request size to: 12
receiving content: 100%|████████████████████████████████████████████████████████████████████| 175158/175158 [143:42:52<00:00,  3.91s/it]
receiving metadata: everything up-to-date.
current historyId: 2903277, current revision: 454462

The follow up sync is predicted to take 16 hours. As has been said here before, this is ridiculous.

Doesn't the official Gmail client use the API? What does it do? I assume it's closed-source, so Wireshark it?

gauteh commented 6 years ago

Alex Szczuczko writes on november 27, 2017 15:11:

The follow up sync is predicted to take 16 hours. As has been said here before, this is ridiculous.

What do you mean?

Doesn't the official Gmail client use the API? What does it do? I assume it's closed-source, so Wireshark it?

As far as I can tell it doesn't download the e-mails, it just syncs the metedata and downloads the messages on-demand.

gauteh commented 6 years ago

If the IMAP protocol is faster at downloading messages it could be used together with the X-GM-MSGID extension to save the messages in place of the GMail API with gmailieer: https://developers.google.com/gmail/imap/imap-extensions#access_to_the_gmail_unique_message_id_x-gm-msgid.

ASzc commented 6 years ago

By ridiculous I mean that it's impractically long. Also, the amount of data I needed could be transferred in under an hour over a not-terrible internet connection (>10 mbps). Annoying that Google doesn't just offer a "give me everything" download option.

Hybrid IMAP could be an solution, although authentication for that might be an issue. As far as I know, it's user+pass only, no oauth. Right now gmailieer could enable IMAP access, but not actually use it?

julian-klode commented 6 years ago

I don't understand why a follow up sync would take 16 hours for you. That seems weird. It should only take 1-2 seconds if nothing changed, as no messages are downloaded, only the list of changes is requested.

A full sync (with the emails available locally, so they are not re-downloaded, but all metadata is queried) on my 50000 emails takes like 20 minutes or so, you only have 3 times as many.

ASzc commented 6 years ago

The first follow up sync had to accommodate 6 days of new mail, as the initial sync took that long to run. Now that the syncs aren't taking so long, the delta is smaller and they run faster. It's a feedback loop.

gauteh commented 6 years ago

Alex Szczuczko writes on november 28, 2017 18:20:

By ridiculous I mean that it's impractically long. Also, the amount of data I needed could be transferred in under an hour over a not-terrible internet connection (>10 mbps). Annoying that Google doesn't just offer a "give me everything" download option.

Yeah - I got that part, was thinking about the follow-up sync.

Hybrid IMAP could be an solution, although authentication for that might be an issue. As far as I know, it's user+pass only, no oauth. Right now gmailieer could enable IMAP access, but not actually use it?

Exactly, it would be a bit of a hassle for users. At the moment I think a lot of users already do have IMAP setup though.

Actaully.. on closer look there is XOAUTH2, perhaps that could work: https://developers.google.com/gmail/imap/imap-smtp

gauteh commented 6 years ago

You could give #55 a shot, it uses IMAP to download messages. I got between 2-3 messages a second with this. Maybe someone with better IMAP knowledge can optimize this.

julian-klode commented 6 years ago

The output indicates that he was reaching 3-4 messages / second via http, so 2-3 does not seem like an improvement. Maybe you should try bumping the batch sizes up again, though Google says you'd get throttled more.

I mean, it only took me a few hours (an hour? I don't remember exactly) to sync my 48465 emails, about 986 MB large. 3 times that should not take like 30 times as long.

It starts at about 700-800 it/s with "fetching messages" (which I think fetches the ids?). Then it starts fetching content at 30 messages / second. It expects a full initial sync to take about 30 minutes.

gauteh commented 6 years ago

Julian Andres Klode writes on november 28, 2017 22:54:

The output indicates that he was reaching 3-4 messages / second via http, so 2-3 does not seem like an improvement. Maybe you should try bumping the batch sizes up again, though Google says you'd get throttled more.

I mean, it only took me a few hours (an hour? I don't remember exactly) to sync my 48465 emails, about 986 MB large. 3 times that should not take like 30 times as long.

It starts at about 700-800 it/s with "fetching messages" (which I think fetches the ids?). Then it starts fetching content at 30 messages / second. It expects a full initial sync to take about 30 minutes.

It's the 'receiving content' part that is slow (presumably?). From the output it seems that the batch sizes got throttled already, so no use in increasing them I think. The size now is the recommended one.

Gmailieer will increase the batch size back to the normal if there has been several successful requests at the current batch size.

gauteh commented 6 years ago

There are also limits at how many times you can download your e-mail box (before it gets severely throttled), this limit I believe is connected to your account: so if you have experimented with downloading using IMAP then gmailieer a few times things are going to get very slow!

Maybe he had enough messages to get throttled more than you or me as well. I think that I needed 4-8 hours to sync about 80k of emails (I don't remember any more).

fikovnik commented 6 years ago

In my case it was quick. My 160K emails of 12GB were synced in 2hrs. In comparison offlineimap took 14hs and mbsync 10hrs.

gauteh commented 6 years ago

I have started a table with initial synchronization time in https://github.com/gauteh/gmailieer/wiki. If any of you have other experiences, please add them there! @fikovnik I have added your report.

In an effort to determine what causes severe throtteling I have added a few fields which should be filled out. If you suspect that this is caused by other variables please let me know (though we should probably keep them at a minimum).

gauteh commented 6 years ago

It seems that recent tests where no big syncs have been performed lately this is fairly resolved. Closing for now.

learmj commented 6 years ago

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

gauteh commented 6 years ago

Matthew Lear writes on april 13, 2018 11:59:

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

Have you checked the points in: https://github.com/gauteh/gmailieer/wiki ?

If you have synced your full account lately (either with gmaileer or other means, e.g. IMAP) your account is likely throttled, and you might get better results by stopping it for a day or two. Google does not provide good guidelines for this which I am aware of.

You can use your own API key with GSuite as well (I do). If you are not allowed to generate the API key with your specific GSuite account, then you can generate it with a regular google account.

The partial/incremental sync done after the full sync is usually done in a few seconds if you perform it frequently.

learmj commented 6 years ago

In your experience, how often is a periodic sync required in order to keep the sync duration short? I appreciate that this may depend on a few variables...

On Fri, 13 Apr 2018, 12:44 Gaute Hope, notifications@github.com wrote:

Matthew Lear writes on april 13, 2018 11:59:

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

Have you checked the points in: https://github.com/gauteh/gmailieer/wiki ?

If you have synced your full account lately (either with gmaileer or other means, e.g. IMAP) your account is likely throttled, and you might get better results by stopping it for a day or two. Google does not provide good guidelines for this which I am aware of.

You can use your own API key with GSuite as well (I do). If you are not allowed to generate the API key with your specific GSuite account, then you can generate it with a regular google account.

The partial/incremental sync done after the full sync is usually done in a few seconds if you perform it frequently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381110458, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnmgc75mBBkmtRw7G6Ky5iwPos-3xzLks5toI-dgaJpZM4PR5so .

gauteh commented 6 years ago

During the day I sync every 2-3 minutes. Most of the time there are no changes, which takes about 0.5 seconds since only one request needs to be made to GMail.

To keep the refresh token and access tokens I think you need to sync more than every two weeks.

The incremental history is only stored for a limited number of events at GMail, if this is expired (say you don't sync in three-four months), then you have to do a full sync. The full sync will not need to re-fetch the content, only the first part of the sync - so equivalent to the first step when you ran gmi. You can force this with gmi pull -f.

So; if you sync often enough for the history to not expire it should be fast. I've never exprienced this, but I've never gone longer than maybe two weeks. One user reported that he had to do a full sync after a few months of inactivity. When fetching the actual messages you get much faster download times for each message than what you do now as well. And as mentioned, you do not need to re-download anything you allready have. If you get a lot of mail or changes to your labels, then you have to sync more often. I have a similar order of total mail as you, so it is probably similar.

Matthew Lear writes on april 14, 2018 0:28:

In your experience, how often is a periodic sync required in order to keep the sync duration short? I appreciate that this may depend on a few variables...

On Fri, 13 Apr 2018, 12:44 Gaute Hope, notifications@github.com wrote:

Matthew Lear writes on april 13, 2018 11:59:

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

Have you checked the points in: https://github.com/gauteh/gmailieer/wiki ?

If you have synced your full account lately (either with gmaileer or other means, e.g. IMAP) your account is likely throttled, and you might get better results by stopping it for a day or two. Google does not provide good guidelines for this which I am aware of.

You can use your own API key with GSuite as well (I do). If you are not allowed to generate the API key with your specific GSuite account, then you can generate it with a regular google account.

The partial/incremental sync done after the full sync is usually done in a few seconds if you perform it frequently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381110458, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnmgc75mBBkmtRw7G6Ky5iwPos-3xzLks5toI-dgaJpZM4PR5so .

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/gauteh/gmailieer/issues/40#issuecomment-381275571

learmj commented 6 years ago

I think there is definitely something not right... At least, it seems so for me. My full sync finally completed at 0333 this morning. My simple shell script to periodically call gmi slept for 2 mins then did another gmi sync. 4 hours later, it's still going.. I hadn't run notmuch new between syncs either so all that had changed was on the server side. Can we add in some sort of verbose status reporting to try and see where / why so much time is getting spent (and reason[s] for)?

On Sat, 14 Apr 2018, 06:33 Gaute Hope, notifications@github.com wrote:

During the day I sync every 2-3 minutes. Most of the time there are no changes, which takes about 0.5 seconds since only one request needs to be made to GMail.

To keep the refresh token and access tokens I think you need to sync more than every two weeks.

The incremental history is only stored for a limited number of events at GMail, if this is expired (say you don't sync in three-four months), then you have to do a full sync. The full sync will not need to re-fetch the content, only the first part of the sync - so equivalent to the first step when you ran gmi. You can force this with gmi pull -f.

So; if you sync often enough for the history to not expire it should be fast. I've never exprienced this, but I've never gone longer than maybe two weeks. One user reported that he had to do a full sync after a few months of inactivity. When fetching the actual messages you get much faster download times for each message than what you do now as well. And as mentioned, you do not need to re-download anything you allready have. If you get a lot of mail or changes to your labels, then you have to sync more often. I have a similar order of total mail as you, so it is probably similar.

Matthew Lear writes on april 14, 2018 0:28:

In your experience, how often is a periodic sync required in order to keep the sync duration short? I appreciate that this may depend on a few variables...

On Fri, 13 Apr 2018, 12:44 Gaute Hope, notifications@github.com wrote:

Matthew Lear writes on april 13, 2018 11:59:

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

Have you checked the points in: https://github.com/gauteh/gmailieer/wiki ?

If you have synced your full account lately (either with gmaileer or other means, e.g. IMAP) your account is likely throttled, and you might get better results by stopping it for a day or two. Google does not provide good guidelines for this which I am aware of.

You can use your own API key with GSuite as well (I do). If you are not allowed to generate the API key with your specific GSuite account, then you can generate it with a regular google account.

The partial/incremental sync done after the full sync is usually done in a few seconds if you perform it frequently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381110458, or mute the thread < https://github.com/notifications/unsubscribe-auth/AJnmgc75mBBkmtRw7G6Ky5iwPos-3xzLks5toI-dgaJpZM4PR5so

.

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/gauteh/gmailieer/issues/40#issuecomment-381275571

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381305174, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnmgUj0-PH_MfWXCi9JnEVAmhNEsqd0ks5toYowgaJpZM4PR5so .

gauteh commented 6 years ago

Notmuch new doesnt change anything. I think there is a —debug flag. As well as a dry-run flag which should give you an idea of what is happening. If there is a lot of actions scheduled there has usually been some batch change of a lot of tags on either side.

søn. 15. apr. 2018 kl. 09:44 skrev Matthew Lear notifications@github.com:

I think there is definitely something not right... At least, it seems so for me. My full sync finally completed at 0333 this morning. My simple shell script to periodically call gmi slept for 2 mins then did another gmi sync. 4 hours later, it's still going.. I hadn't run notmuch new between syncs either so all that had changed was on the server side. Can we add in some sort of verbose status reporting to try and see where / why so much time is getting spent (and reason[s] for)?

On Sat, 14 Apr 2018, 06:33 Gaute Hope, notifications@github.com wrote:

During the day I sync every 2-3 minutes. Most of the time there are no changes, which takes about 0.5 seconds since only one request needs to be made to GMail.

To keep the refresh token and access tokens I think you need to sync more than every two weeks.

The incremental history is only stored for a limited number of events at GMail, if this is expired (say you don't sync in three-four months), then you have to do a full sync. The full sync will not need to re-fetch the content, only the first part of the sync - so equivalent to the first step when you ran gmi. You can force this with gmi pull -f.

So; if you sync often enough for the history to not expire it should be fast. I've never exprienced this, but I've never gone longer than maybe two weeks. One user reported that he had to do a full sync after a few months of inactivity. When fetching the actual messages you get much faster download times for each message than what you do now as well. And as mentioned, you do not need to re-download anything you allready have. If you get a lot of mail or changes to your labels, then you have to sync more often. I have a similar order of total mail as you, so it is probably similar.

Matthew Lear writes on april 14, 2018 0:28:

In your experience, how often is a periodic sync required in order to keep the sync duration short? I appreciate that this may depend on a few variables...

On Fri, 13 Apr 2018, 12:44 Gaute Hope, notifications@github.com wrote:

Matthew Lear writes on april 13, 2018 11:59:

I'm currently on day 5 of my initial sync of 150K+ emails ~10GB, 256 notmuch tags. Batch request was reduced to 25, to 12, to 4 then to 1 where it stayed. I have a GSuite account which doesn't have API key access. Unless sync time improves drastically after the initial sync, I'll have to ditch gmailieer which would be very disappointing as the two-way tagging would have been perfect for me.

Have you checked the points in: https://github.com/gauteh/gmailieer/wiki ?

If you have synced your full account lately (either with gmaileer or other means, e.g. IMAP) your account is likely throttled, and you might get better results by stopping it for a day or two. Google does not provide good guidelines for this which I am aware of.

You can use your own API key with GSuite as well (I do). If you are not allowed to generate the API key with your specific GSuite account, then you can generate it with a regular google account.

The partial/incremental sync done after the full sync is usually done in a few seconds if you perform it frequently.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/gauteh/gmailieer/issues/40#issuecomment-381110458 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AJnmgc75mBBkmtRw7G6Ky5iwPos-3xzLks5toI-dgaJpZM4PR5so

.

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/gauteh/gmailieer/issues/40#issuecomment-381275571

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381305174, or mute the thread < https://github.com/notifications/unsubscribe-auth/AJnmgUj0-PH_MfWXCi9JnEVAmhNEsqd0ks5toYowgaJpZM4PR5so

.

You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381387100, or mute the thread https://github.com/notifications/unsubscribe-auth/AADd-8rKxdbmbRdtS8IP87_Bgvs4GCqAks5tovpTgaJpZM4PR5so .

gauteh commented 6 years ago

Could you paste the output from the ongoing sync?

Matthew Lear writes on april 15, 2018 9:44:

I think there is definitely something not right... At least, it seems so for me. My full sync finally completed at 0333 this morning. My simple shell script to periodically call gmi slept for 2 mins then did another gmi sync. 4 hours later, it's still going.. I hadn't run notmuch new between syncs either so all that had changed was on the server side. Can we add in some sort of verbose status reporting to try and see where / why so much time is getting spent (and reason[s] for)?

learmj commented 6 years ago

It finished last night. Finally! I seemed to get a very low it/s rate when pushing. Batch usually gets reduced to 1. I updated my notmuch tags to tag about 10k+ mails and ran another sync. Took about 12 mins... For sure I'll monitor the behaviour and raise another tricket if I have issues. Seems inappropriate to post here since my initial sync is complete now.

On Mon, 16 Apr 2018, 08:14 Gaute Hope, notifications@github.com wrote:

Could you paste the output from the ongoing sync?

Matthew Lear writes on april 15, 2018 9:44:

I think there is definitely something not right... At least, it seems so for me. My full sync finally completed at 0333 this morning. My simple shell script to periodically call gmi slept for 2 mins then did another gmi sync. 4 hours later, it's still going.. I hadn't run notmuch new between syncs either so all that had changed was on the server side. Can we add in some sort of verbose status reporting to try and see where / why so much time is getting spent (and reason[s] for)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381501919, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnmgVYMFAaU-5AnM1sXD068_eNq7dE3ks5tpES7gaJpZM4PR5so .

gauteh commented 6 years ago

Good stuff. You have fast incremental syncs now? 10k+ changes in 12 mins seems pretty decent.

Matthew Lear writes on april 16, 2018 10:15:

It finished last night. Finally! I seemed to get a very low it/s rate when pushing. Batch usually gets reduced to 1. I updated my notmuch tags to tag about 10k+ mails and ran another sync. Took about 12 mins... For sure I'll monitor the behaviour and raise another tricket if I have issues. Seems inappropriate to post here since my initial sync is complete now.

On Mon, 16 Apr 2018, 08:14 Gaute Hope, notifications@github.com wrote:

Could you paste the output from the ongoing sync?

Matthew Lear writes on april 15, 2018 9:44:

I think there is definitely something not right... At least, it seems so for me. My full sync finally completed at 0333 this morning. My simple shell script to periodically call gmi slept for 2 mins then did another gmi sync. 4 hours later, it's still going.. I hadn't run notmuch new between syncs either so all that had changed was on the server side. Can we add in some sort of verbose status reporting to try and see where / why so much time is getting spent (and reason[s] for)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gauteh/gmailieer/issues/40#issuecomment-381501919, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnmgVYMFAaU-5AnM1sXD068_eNq7dE3ks5tpES7gaJpZM4PR5so .

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/gauteh/gmailieer/issues/40#issuecomment-381517032