Resource exports fail for large datasets (>1000 resources)

azerbini commented 6 years ago

Exporting to csv, shp, json fails when large datasets are being exported. This is the same as this issue: https://github.com/azerbini/eamena_v3/issues/57

TeriForey commented 6 years ago

Hi @azerbini I'm having trouble recreating this error. With ~1300 resources, I was able to export to json. With ~1500 I saw a low memory error but I think that is entirely due to the system I'm running EAMENA on. I might be unable to recreate the export error as my resources are extremely simple (I've been importing a very simple information resource repeatedly to get the resources numbers up). Would you be able to send me a more complex dataset that you believe would fail? Thank you.

azerbini commented 6 years ago

Hi Teri, I have sent you a much larger dump now. Let me know how work progresses on this. Cheers

TeriForey commented 6 years ago

Hi Andrea, this appears to be a Java memory problem ( TransportError(500, u'OutOfMemoryError[Java heap space]') is output to the log).

Depending on the system you're running on it could be fixed by exporting ES_HEAP_SIZE before starting elastic search. It's recommended to not set this higher than 32GB or more than 50% of the available RAM (see the elasticsearch guide). If you don't have sufficient RAM available to fix the issue then I might be able to use scroll to go through the search in sections. Let me know if you need the bigger fix!

azerbini commented 6 years ago

Hi Teri,

The heap size is set pretty high on our production machine, and the search still fails. Please do try to use scroll to break up the search into sections.

Cheers

From: Teri Forey notifications@github.com Sent: 29 June 2018 10:54:05 To: azerbini/eamena_dev Cc: Andrea Zerbini; Mention Subject: Re: [azerbini/eamena_dev] Resource exports fail for large datasets (>1000 resources) (#32)

Hi Andrea, this appears to be a Java memory problem ( TransportError(500, u'OutOfMemoryError[Java heap space]') is output to the log).

Depending on the system you're running on it could be fixed by exporting ES_HEAP_SIZE before starting elastic search. It's recommended to not set this higher than 32GB or more than 50% of the available RAM (see the elasticsearch guidehttps://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html). If you don't have sufficient RAM available to fix the issue then I might be able to use scrollhttps://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html to go through the search in sections. Let me know if you need the bigger fix!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/azerbini/eamena_dev/issues/32#issuecomment-401308074, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ASWgd1e1WgiFMJfk-qSGlv2Ni7TLrlnEks5uBfk9gaJpZM4UxlRx.

TeriForey commented 6 years ago

I've added a pull request which should fix this issue. I still have issues with memory when trying to serialize a large number of resources (but I'm running eamena on a small system) so let me know if you see that too and I'll see if I can break the json/csv etc writes into chunks.

azerbini commented 6 years ago

@TeriForey please see comments to PR #34

TeriForey commented 6 years ago

@azerbini When exporting to JSON, instead of exporting the ElasticSearch results, the resource ID's are taken and each resource is pulled out from the database (including all child resources). This happens in arches/app/utils/data_management/resources/formats/archesjson.py line 41.

                a_resource = Resource().get(resource['_id'])

From my profiling it's clear that this is the bottleneck that's slowing everything down and thus causing timeouts. There are loads of database querying and the Entity.get() method is taking up a large proportion of the cumulative time.

         138674275 function calls (127280583 primitive calls) in 233.759 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   240826   47.768    0.000   49.395    0.000 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
   639122    8.657    0.000   21.904    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:213(clone)
   240825    4.956    0.000   41.983    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:64(as_sql)
   240826    4.837    0.000   60.035    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/backends/util.py:66(execute)
   240358    4.739    0.000   10.469    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:255(get_default_columns)
  7313930    4.369    0.000    4.369    0.000 {hasattr}
   486880    4.263    0.000  133.436    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query.py:160(iterator)
   280220    3.755    0.000   29.202    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1008(build_filter)
 12382906    3.598    0.000    5.088    0.000 {isinstance}
  1760844    3.497    0.000    3.958    0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/datastructures.py:127(__init__)
   241292    2.799    0.000    4.954    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:105(__init__)
   280220    2.665    0.000   12.792    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:166(make_atom)
   639122    2.559    0.000   25.670    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query.py:837(_clone)
   230854    2.486    0.000   92.376    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/fields/related.py:297(__get__)
   280220    2.430    0.000    6.531    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:355(process)
   240358    2.400    0.000   14.148    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:173(get_columns)
  1041690    2.239    0.000   10.087    0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/tree.py:87(add)
   141792    2.142    0.000    7.204    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/fields/related.py:1069(get_lookup_constraint)
  1760844    2.106    0.000    2.644    0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/datastructures.py:122(__new__)
   339458    2.089    0.000   32.660    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1206(_add_q)
39794/398    2.014    0.000  223.182    0.561 /local/project/eamena_dev/arches/app/models/entity.py:82(get)
   280220    1.956    0.000    2.829    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/query.py:1377(trim_joins)
   339458    1.932    0.000    3.122    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/query_utils.py:43(__init__)
  2861756    1.925    0.000    1.925    0.000 /local/project/ENV/lib/python2.7/site-packages/django/utils/tree.py:18(__init__)
  1616954    1.872    0.000    2.847    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/operations.py:96(quote_name)
   486880    1.855    0.000  118.690    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/compiler.py:702(results_iter)
  1278244    1.832    0.000    4.021    0.000 /local/project/ENV/lib/python2.7/site-packages/django/db/models/sql/where.py:292(clone)

From what I can tell, all of the data that Resource().get() is retrieving is already contained within the elasticsearch result so this whole step is probably unnecessary. However, the elasticsearch result does contain a lot of extra information, for example it contains 'flat_child_entities' as well as 'child_entities'. When exporting a single resource, the Resource().get() method creates a 2,970 line JSON file while exporting the elasticsearch resource directly creates a 17,807 line JSON file.

So my question is, if I remove the Resource().get() step and instead export the elasticsearch results directly do you think this will cause any problems? Will I need to remove fields from the elasticsearch resource so that it more closely resembles the current export?

azerbini commented 6 years ago

@TeriForey could you please rereference 0bc2c68 as it has nothing to do with this issue. Also, have you experimented with removing Resource().get() ? I am happy for you to go ahead with that.

TeriForey commented 6 years ago

@azerbini Oops! I've changed the commit message but I'm afraid it's cached the previous message now - sorry about that!

I've updated the PR with the Resource.get() - actually the whole loop - removed.

azerbini / eamena_dev

Resource exports fail for large datasets (>1000 resources) #32