Open azerbini opened 6 years ago
Hi @azerbini I'm having trouble recreating this error. With ~1300 resources, I was able to export to json. With ~1500 I saw a low memory error but I think that is entirely due to the system I'm running EAMENA on. I might be unable to recreate the export error as my resources are extremely simple (I've been importing a very simple information resource repeatedly to get the resources numbers up). Would you be able to send me a more complex dataset that you believe would fail? Thank you.
Hi Teri, I have sent you a much larger dump now. Let me know how work progresses on this. Cheers
Hi Andrea, this appears to be a Java memory problem ( TransportError(500, u'OutOfMemoryError[Java heap space]')
is output to the log).
Depending on the system you're running on it could be fixed by exporting ES_HEAP_SIZE
before starting elastic search. It's recommended to not set this higher than 32GB or more than 50% of the available RAM (see the elasticsearch guide). If you don't have sufficient RAM available to fix the issue then I might be able to use scroll to go through the search in sections. Let me know if you need the bigger fix!
Hi Teri,
The heap size is set pretty high on our production machine, and the search still fails. Please do try to use scroll to break up the search into sections.
Cheers
From: Teri Forey notifications@github.com Sent: 29 June 2018 10:54:05 To: azerbini/eamena_dev Cc: Andrea Zerbini; Mention Subject: Re: [azerbini/eamena_dev] Resource exports fail for large datasets (>1000 resources) (#32)
Hi Andrea, this appears to be a Java memory problem ( TransportError(500, u'OutOfMemoryError[Java heap space]') is output to the log).
Depending on the system you're running on it could be fixed by exporting ES_HEAP_SIZE before starting elastic search. It's recommended to not set this higher than 32GB or more than 50% of the available RAM (see the elasticsearch guidehttps://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html). If you don't have sufficient RAM available to fix the issue then I might be able to use scrollhttps://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html to go through the search in sections. Let me know if you need the bigger fix!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/azerbini/eamena_dev/issues/32#issuecomment-401308074, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ASWgd1e1WgiFMJfk-qSGlv2Ni7TLrlnEks5uBfk9gaJpZM4UxlRx.
I've added a pull request which should fix this issue. I still have issues with memory when trying to serialize a large number of resources (but I'm running eamena on a small system) so let me know if you see that too and I'll see if I can break the json/csv etc writes into chunks.
@TeriForey please see comments to PR #34
@azerbini When exporting to JSON, instead of exporting the ElasticSearch results, the resource ID's are taken and each resource is pulled out from the database (including all child resources). This happens in arches/app/utils/data_management/resources/formats/archesjson.py line 41.
a_resource = Resource().get(resource['_id'])
From my profiling it's clear that this is the bottleneck that's slowing everything down and thus causing timeouts. There are loads of database querying and the Entity.get() method is taking up a large proportion of the cumulative time.
138674275 function calls (127280583 primitive calls) in 233.759 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
240826 47.768 0.000 49.395 0.000 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
639122 8.657 0.000 21.904 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:213(clone)
240825 4.956 0.000 41.983 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:64(as_sql)
240826 4.837 0.000 60.035 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/backends/util.py:66(execute)
240358 4.739 0.000 10.469 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:255(get_default_columns)
7313930 4.369 0.000 4.369 0.000 {hasattr}
486880 4.263 0.000 133.436 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query.py:160(iterator)
280220 3.755 0.000 29.202 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1008(build_filter)
12382906 3.598 0.000 5.088 0.000 {isinstance}
1760844 3.497 0.000 3.958 0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/datastructures.py:127(__init__)
241292 2.799 0.000 4.954 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:105(__init__)
280220 2.665 0.000 12.792 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:166(make_atom)
639122 2.559 0.000 25.670 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query.py:837(_clone)
230854 2.486 0.000 92.376 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/fields/related.py:297(__get__)
280220 2.430 0.000 6.531 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:355(process)
240358 2.400 0.000 14.148 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:173(get_columns)
1041690 2.239 0.000 10.087 0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/tree.py:87(add)
141792 2.142 0.000 7.204 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/fields/related.py:1069(get_lookup_constraint)
1760844 2.106 0.000 2.644 0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/datastructures.py:122(__new__)
339458 2.089 0.000 32.660 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1206(_add_q)
39794/398 2.014 0.000 223.182 0.561 /local/project/eamena_dev/arches/app/models/entity.py:82(get)
280220 1.956 0.000 2.829 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1377(trim_joins)
339458 1.932 0.000 3.122 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query_utils.py:43(__init__)
2861756 1.925 0.000 1.925 0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/tree.py:18(__init__)
1616954 1.872 0.000 2.847 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/operations.py:96(quote_name)
486880 1.855 0.000 118.690 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:702(results_iter)
1278244 1.832 0.000 4.021 0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:292(clone)
From what I can tell, all of the data that Resource().get() is retrieving is already contained within the elasticsearch result so this whole step is probably unnecessary. However, the elasticsearch result does contain a lot of extra information, for example it contains 'flat_child_entities' as well as 'child_entities'. When exporting a single resource, the Resource().get() method creates a 2,970 line JSON file while exporting the elasticsearch resource directly creates a 17,807 line JSON file.
So my question is, if I remove the Resource().get() step and instead export the elasticsearch results directly do you think this will cause any problems? Will I need to remove fields from the elasticsearch resource so that it more closely resembles the current export?
@TeriForey could you please rereference 0bc2c68 as it has nothing to do with this issue. Also, have you experimented with removing Resource().get() ? I am happy for you to go ahead with that.
@azerbini Oops! I've changed the commit message but I'm afraid it's cached the previous message now - sorry about that!
I've updated the PR with the Resource.get() - actually the whole loop - removed.
Exporting to csv, shp, json fails when large datasets are being exported. This is the same as this issue: https://github.com/azerbini/eamena_v3/issues/57