digital-guard / preserv

Digital Preservation Project
http://git.digital-guard.org/preserv
Apache License 2.0
0 stars 0 forks source link

ERROR: array size exceeds the maximum allowed (1073741823) #100

Closed 0e1 closed 2 years ago

0e1 commented 2 years ago

O processo de ingestão do layer via (e provavelmente do layer geoaddress quando se fizer union com centróides dos building do OpenStreetMap) de _pk0004.03 retornou o erro

ERROR: array size exceeds the maximum allowed (1073741823)

no contexto

CONTEXT:  SQL function "feature_asis_assign_volume" statement 1
SQL statement "UPDATE ingest.donated_PackComponent
    SET proc_step=2,   -- if insert process occurs after q_query.
        lineage = lineage || ingest.feature_asis_assign(q_file_id) || 
        jsonb_build_object('statistics',(stats || stats_dup || ARRAY[num_items-stats_dup[1]+stats_dup[3]]) )
    WHERE id=q_file_id"
PL/pgSQL function ingest.osm_load(text,text,text,bigint,text,text[],integer,text,boolean) line 240 at SQL statement
make: *** [makefile:150: via] Error 1

provavelmente quando se aplica _stcollect na função feature_asis_assign_volume: https://github.com/digital-guard/preserv/blob/main/src/ingest-step1-ini.sql#L533

São quase 5 milhões de vias:

ingest88=# SELECT count(*)  FROM jplanet_osm_line WHERE tags->>'highway' IN ('residential','unclassified','tertiary','secondary','primary','trunk','motorway');
  count  
---------
 4856390
(1 row)

Os limites do banco de dados podem serem vistos em: https://www.postgresql.org/docs/current/limits.html

Avaliar como contornar.

Algumas opções:

  1. Usar os recortes das 5 grandes regiões brasileiras, disponibilizadas pela geofabrik: https://download.geofabrik.de/south-america/brazil.html
  2. Produzir 27 recortes, um para cada cada estado + distrito federal, a partir do recorte nacional disponibilizado pela Geofabrik. Com isso, fazer a ingestão de cada um deles.

Como não consideramos as 5 regiões brasileiras como um nível jurisdicional e o problema poderia persistir (especialmente na região sudeste) acho que a opção 2 mais razoável.

0e1 commented 2 years ago

Resumo da ingestão de via e geoaddress, usando a flag:


 From file_id=5 inserted type=via_full.                                 +
                                                                        +
         Statistics:                                                    +
 .                                                                      +
         Before deduplication:                                          +
                                                                        +
         Originals: 4856390 items.                                      +
                                                                        +
         Not Intersecs: 55559 items.                                    +
                                                                        +
         Invalid: 0 items.                                              +
                                                                        +
         Not simple: 1834 items.                                        +
                                                                        +
         Empty: 1671 items.                                             +
                                                                        +
         Small: 3140 items.                                             +
                                                                        +
         Null: 0 items.                                                 +
                                                                        +
         Invalid geometry type: 0 items.                                +
                                                                        +
         Not closed: 0 items.                                           +
                                                                        +
         Inserted in feature_asis: 4795887 items.                       +
                                                                        +
         Inserted in feature_asis_discarded: 60503 items.               +
                                                                        +
                                                                        +
         After deduplication:                                           +
                                                                        +
         Removed duplicates from feature_asis: 499737 items.            +
                                                                        +
         Inserted in feature_asis_discarded (duplicates): 499737 items. +
                                                                        +
         Inserted in feature_asis (aggregated duplicates): 247510 items.+
                                                                        +
         Resulting in feature_asis: 4543660                             +

------------------------------------------------------------------------
 From file_id=6 inserted type=geoaddress_full.                         +
                                                                       +
         Statistics:                                                   +
 .                                                                     +
         Before deduplication:                                         +
                                                                       +
         Originals: 1101237 items.                                     +
                                                                       +
         Not Intersecs: 91705 items.                                   +
                                                                       +
         Invalid: 0 items.                                             +
                                                                       +
         Not simple: 0 items.                                          +
                                                                       +
         Empty: 0 items.                                               +
                                                                       +
         Small: 0 items.                                               +
                                                                       +
         Null: 0 items.                                                +
                                                                       +
         Invalid geometry type: 0 items.                               +
                                                                       +
         Not closed: 0 items.                                          +
                                                                       +
         Inserted in feature_asis: 1009532 items.                      +
                                                                       +
         Inserted in feature_asis_discarded: 91705 items.              +
                                                                       +
                                                                       +
         After deduplication:                                          +
                                                                       +
         Removed duplicates from feature_asis: 23696 items.            +
                                                                       +
         Inserted in feature_asis_discarded (duplicates): 23696 items. +
                                                                       +
         Inserted in feature_asis (aggregated duplicates): 11662 items.+
                                                                       +
         Resulting in feature_asis: 997498                             +
0e1 commented 2 years ago

A publicação dos dados de geoaddress ocorreu sem problemas, como pode ser vista em https://github.com/digital-guard/preservCutGeo-BR2021/tree/main/data/_pk0004.03/geoaddress.

No entanto, a publicação de via, retornou:

--- Gerando arquivos em /var/gits/_dg/preservCutGeo-BR2021/data/_pk0004.03/via ---
psql postgres://postgres@localhost/ingest88 -c "SELECT ingest.publicating_geojsons('via','BR','/var/gits/_dg/preservCutGeo-BR2021/data/_pk0004.03/via','1',9,'3',true);"
SSL SYSCALL error: EOF detected
connection to server was lost
make: *** [makefile:186: publicating_geojsons_via] Error 2
0e1 commented 2 years ago

Executando manualmente o processo de publicação:

  1. ingest.publicating_geojsons_p1 executa sem problemas:

    ingest88=# SELECT ingest.publicating_geojsons_p1(5::bigint,'BR');
    
    publicating_geojsons_p1 
    -------------------------
    p1
    (1 row)
  2. ingest.publicating_geojsons_p3 Lembrar que ingest.publicating_geojsons_p2 não é mais executada! Ao executar manualmente o início de ingest.publicating_geojsons_p3:

ingest88=#     INSERT INTO ingest.publicating_geojsons_p2distrib
        SELECT t.hcode, t.n_items, t.n_keys,  t.jj, -- length(t.hcode) AS len,
        ST_Intersection(
            ST_SetSRID( ST_geomFromGeohash(replace(t.hcode, '*', '')) ,  4326),
            (SELECT geom FROM ingest.vw01full_jurisdiction_geom WHERE isolabel_ext='BR')
        ) AS geom
        FROM hcode_distribution_reduce_recursive_raw_alt(
            ((SELECT jsonb_object_agg(kx_ghs9,(CASE (SELECT geomtype FROM ingest.vw03full_layer_file WHERE id=5) WHEN 'point' THEN 1::bigint ELSE ((info->'bytes')::bigint) END) ) FROM ingest.publicating_geojsons_p3exprefix)),
            1,
            (SELECT length((geohash_cover_list(geom))[1]) FROM ingest.vw01full_jurisdiction_geom WHERE isolabel_ext='BR'),
            9,
            (SELECT (lineage->'hcode_distribution_parameters'->'p_threshold_sum')::int FROM ingest.donated_PackComponent WHERE id= 5),
            (CASE (SELECT geomtype FROM ingest.vw03full_layer_file WHERE id=5) WHEN 'point' THEN 1000::int ELSE 102400::int END)
        ) t
    ;

SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Failed.

Parece que é com a execução de _hcode_distribution_reduce_recursive_rawalt.

0e1 commented 2 years ago

O problema estava relacionado a quantidade de memória sendo usada pelo processo de geração da distribuição. Os ajustes feitos em https://github.com/AddressForAll/pg_pubLib-v1/commit/babc2b339d5411828733fee8ab92e7014021a122 adequaram o uso da memória à realidade do servidor.