ecederstrand / exchangelib

Python client for Microsoft Exchange Web Services (EWS)
BSD 2-Clause "Simplified" License
1.18k stars 248 forks source link

MemoryError when using export #551

Closed bencleary closed 5 years ago

bencleary commented 5 years ago

Hi,

Firstly thank you for this library! I have been using this to aid me merging Office 365 tenancies, contacts work great, calendar as well but when I start using export and upload for emails it throws a MemoryError at around 250 emails in, the account I have been working on only has 270 items in the inbox, so I am wondering if you know anything I can do to stop that from happening?

I will post my code tomorrow as I am on my phone at the moment, but any advice would be appreciated

Thanks 👍

ecederstrand commented 5 years ago

That depends on where the MemoryError comes from. You'll have to post the stack trace to track this down.

bencleary commented 5 years ago

Thanks for getting back in touch, here is the code, i am just running it to get the stacktrace. I am using a while loop as when i just try to use .all() or .iterator() i get a connection forcibly closed error, slicing it into smaller chunks seem to help (i know the slicing here is rudimentary, this is not a live account just for testing so missing items are fine), like i said calendars and contacts have worked fine its just emails where i run into problems.

class MigrateInbox(MigrationConfig):

    def inbox_count(self):
        return self.current_account.inbox.total_count

    def migrated_inbox_count(self):
        self.target_account.inbox.refresh()
        return self.target_account.inbox.total_count

    def migrate_inbox(self):
        count = self.inbox_count()
        print(f"Current Email (Inbox) Count - {count}")
        folder = self.current_account.inbox
        pagesize = 5
        index = 0
        current = 0
        while index < count:
            current += pagesize
            items = folder.all().only('mime_content')[index:current]
            data = self.current_account.export(items)
            self.bulk_migrate(folder=self.target_account.inbox, upload=data) # just a wrapper for upload
            index += pagesize
            print(f"index -> {index} ---- current -> {current}")
bencleary commented 5 years ago

Here is the stack trace from when i ran it this morning:

C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\Scripts\python.exe C:/Users/benja/Development/office_365_tools/ews-test.py
Current Email (Inbox) Count - 265
index -> 5 ---- current -> 5
index -> 10 ---- current -> 10
index -> 15 ---- current -> 15
index -> 20 ---- current -> 20
index -> 25 ---- current -> 25
index -> 30 ---- current -> 30
index -> 35 ---- current -> 35
index -> 40 ---- current -> 40
index -> 45 ---- current -> 45
index -> 50 ---- current -> 50
index -> 55 ---- current -> 55
index -> 60 ---- current -> 60
index -> 65 ---- current -> 65
index -> 70 ---- current -> 70
index -> 75 ---- current -> 75
index -> 80 ---- current -> 80
index -> 85 ---- current -> 85
index -> 90 ---- current -> 90
index -> 95 ---- current -> 95
index -> 100 ---- current -> 100
index -> 105 ---- current -> 105
index -> 110 ---- current -> 110
index -> 115 ---- current -> 115
index -> 120 ---- current -> 120
index -> 125 ---- current -> 125
index -> 130 ---- current -> 130
index -> 135 ---- current -> 135
index -> 140 ---- current -> 140
index -> 145 ---- current -> 145
index -> 150 ---- current -> 150
index -> 155 ---- current -> 155
index -> 160 ---- current -> 160
index -> 165 ---- current -> 165
index -> 170 ---- current -> 170
index -> 175 ---- current -> 175
index -> 180 ---- current -> 180
index -> 185 ---- current -> 185
index -> 190 ---- current -> 190
index -> 195 ---- current -> 195
index -> 200 ---- current -> 200
index -> 205 ---- current -> 205
index -> 210 ---- current -> 210
index -> 215 ---- current -> 215
index -> 220 ---- current -> 220
index -> 225 ---- current -> 225
index -> 230 ---- current -> 230
index -> 235 ---- current -> 235
EWS https://outlook.office365.com/EWS/Exchange.asmx, account XXXX HIDDEN XXXX: Exception in _get_elements: Traceback (most recent call last):
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 89, in _get_elements
    response = self._get_response_xml(payload=payload)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 171, in _get_response_xml
    res = self._get_soap_payload(response=r, **parse_opts)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 260, in _get_soap_payload
    root = to_xml(response.iter_content())
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\util.py", line 365, in to_xml
    return parse(stream, parser=forgiving_parser)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\defusedxml\lxml.py", line 134, in parse
    elementtree = _etree.parse(source, parser, base_url=base_url)
  File "src\lxml\etree.pyx", line 3424, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\util.py", line 340, in getvalue
    res = b''.join(self._bytes_generator)
MemoryError

Traceback (most recent call last):
  File "C:/Users/benja/Development/office_365_tools/ews-test.py", line 10, in <module>
    EmailMigration(old_account=current, new_account=target).migrate_inbox()
  File "C:\Users\benja\Development\office_365_tools\office_365_migration\email_migration.py", line 89, in migrate_inbox
    data = self.current_account.export(items)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\account.py", line 320, in export
    self._consume_item_service(service_cls=ExportItems, items=items, chunk_size=chunk_size, kwargs=dict())
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\account.py", line 302, in _consume_item_service
    is_empty, items = peek(items)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\util.py", line 118, in peek
    first = next(iterable)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\queryset.py", line 298, in __iter__
    for val in self._format_items(items=self._query(), return_format=self.return_format):
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\queryset.py", line 375, in _item_yielder
    for i in iterable:
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\account.py", line 580, in fetch
    shape=ID_ONLY,
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\account.py", line 308, in _consume_item_service
    for i in service_cls(account=self, chunk_size=chunk_size).call(**kwargs):
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 676, in _pool_requests
    elems = r.get()
  File "c:\python37\Lib\multiprocessing\pool.py", line 657, in get
    raise self._value
  File "c:\python37\Lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 656, in <lambda>
    lambda c: self._get_elements(payload=payload_func(c, **kwargs)),
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 89, in _get_elements
    response = self._get_response_xml(payload=payload)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 171, in _get_response_xml
    res = self._get_soap_payload(response=r, **parse_opts)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\services.py", line 260, in _get_soap_payload
    root = to_xml(response.iter_content())
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\util.py", line 365, in to_xml
    return parse(stream, parser=forgiving_parser)
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\defusedxml\lxml.py", line 134, in parse
    elementtree = _etree.parse(source, parser, base_url=base_url)
  File "src\lxml\etree.pyx", line 3424, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "C:\Users\benja\.virtualenvs\office_365_tools-UdUAg9MW\lib\site-packages\exchangelib\util.py", line 340, in getvalue
    res = b''.join(self._bytes_generator)
MemoryError
bencleary commented 5 years ago

Swapping my Python version to 64 bit solved the memoryerror, doing some memory profiling i can see at around the 240 mark, memory usage spikes to 3.5gb which i guess is near the limit for 32bit python...using 64bit solved that, just wondering if there are any other options you could advise on large amounts of emails, for instance say there is a mailbox of about 10gb, are there any other methods in this library that could help speed up the querying, changing cache from in memory to disk (i know speed would be hit but space would be better), i know the export method is heavy as its encoded strings but just wondering if there is anything else you can suggest?

ecederstrand commented 5 years ago

The export() method doesn't need a full item, just the item ID of the messages to export. So instead of items = folder.all().only('mime_content')[index:current] you could do just items = folder.all().only('item_id', 'changekey')[index:current]. That would reduce the memory pressure somewhat.

It would be great if you could run a memory profiler over your code to pinpoint what is consuming all the memory. The stack trace is not very helpful because it crashes at the point you run out of memory, which is not necessarily where the bulk of the memory is being consumed.

Is it possible that some of your items contain huge attachments? You could try exporting just one item at a time and then dump to disk:

i = 1
for item in account.inbox.all().only('item_id', 'changekey'):
    data = account.export([item])[0]
    with open('item%s.dat' % i, 'w') as f:
        f.write(data)
    i += 1
bencleary commented 5 years ago

Well since your last answer, i haven't experienced the memory error, changing to item_id and changekey have really reduced the memory usage and it rarely goes over 400mb now. I think you can chalk this up to user error.

On a plus side, i have successfully migrated 2 existing Office 365 tenants into one using it, total mailbox data around 25gb per tenant, so that has saved me a lot of time and hassle!! Thanks 👍

bencleary commented 5 years ago

I will mark this as closed now as its all working fine.

ecederstrand commented 5 years ago

Glad to get successful reports for this code path with a significant data volume!