ckan / ckanext-xloader

Express Loader - quickly load data into DataStore. A replacement for DataPusher.
GNU Affero General Public License v3.0
46 stars 50 forks source link

Update direct csv fast load to handle empty ending lines on csv and other edge conditions #206

Closed duttonw closed 6 months ago

duttonw commented 9 months ago

Example logs from importing 40mb csv file with 400,000+ rows.

https://www.data.qld.gov.au/dataset/5efaa096-4480-4540-88be-a10ababd9f49/resource/a14317b7-2fca-41b7-8294-9a1f7a085b0f

deleting "a14317b7-2fca-41b7-8294-9a1f7a085b0f" from DataStore.
09:57:34,304 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Deleting "a14317b7-2fca-41b7-8294-9a1f7a085b0f" from DataStore.
09:57:34,414 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=False on resource a14317b7-2fca-41b7-8294-9a1f7a085b0f
/mnt/local_data/ckan_venv/src/ckan/ckanext/datastore/logic/action.py:646: SAWarning: TypeDecorator JsonDictType() will not produce a cache key because the ``cache_ok`` attribute is not set to True
.  This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions.  Set this attribute to True if this type object's state is
 safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
  q.update({'extras': extras}, synchronize_session=False)
Fields: [{'id': 'NOTIFICATION_DATE', 'type': 'text', 'info': {'notes': 'Date Queensland Health were notified of first COVID-19 detection from case.', 'type_override': '', 'label': 'Notification Da
te'}}, {'id': 'AGE_GROUP_5Y', 'type': 'text', 'info': {'notes': 'The five-year age group according to age of case at time of detection.', 'type_override': '', 'label': ''}}]
09:57:34,919 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Fields: [{'id': 'NOTIFICATION_DATE', 'type': 'text', 'info': {'notes': 'Date Queensland Health were notified of first COVID-19 detection f
rom case.', 'type_override': '', 'label': 'Notification Date'}}, {'id': 'AGE_GROUP_5Y', 'type': 'text', 'info': {'notes': 'The five-year age group according to age of case at time of detection.',
'type_override': '', 'label': ''}}]
09:57:34,981 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=True on resource a14317b7-2fca-41b7-8294-9a1f7a085b0f
Copying to database...
09:57:35,269 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Copying to database...
missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

09:57:41,240 WARNI [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

Load using COPY failed: Error during the load into PostgreSQL: missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

09:57:41,248 WARNI [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Load using COPY failed: Error during the load into PostgreSQL: missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

Trying again with tabulator
09:57:41,253 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Trying again with tabulator
Determining column names and types
09:57:41,259 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Determining column names and types
duttonw commented 9 months ago

similar issue https://www.data.qld.gov.au/dataset/ambient-estuarine-water-quality-monitoring-data-near-real-time-sites-2012-to-present-day/resource/5d1e8368-7ec3-435a-92d0-280ad1e3db0d?truncate=30&inner_span=True

aving chunk 64
10:57:08,871 INFO  [86b1fa53-ffb9-4d52-9042-afb9b66ed3d6] Saving chunk 64
10:57:09,717 ERROR [ckanext.xloader.jobs] xloader error: list index out of range, Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 293, in load_csv
    f)
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT:  COPY 5d1e8368-7ec3-435a-92d0-280ad1e3db0d, line 16306: "2023-05-14 12:10:00,129534,,,,,,,,,,,-20.169,,148.464,,,,0,,,,,,264.4,,,,,,3.187,,,,,,,,,,,,,,,,,,,,..."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 222, in xloader_data_into_datastore_
    direct_load()
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 173, in direct_load
    logger=logger)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 301, in load_csv
    ' {}'.format(error_str))
ckanext.xloader.job_exceptions.LoaderError: Error during the load into PostgreSQL: extra data after last expected column
CONTEXT:  COPY 5d1e8368-7ec3-435a-92d0-280ad1e3db0d, line 16306: "2023-05-14 12:10:00,129534,,,,,,,,,,,-20.169,,148.464,,,,0,,,,,,264.4,,,,,,3.187,,,,,,,,,,,,,,,,,,,,..."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 80, in xloader_data_into_datastore
    xloader_data_into_datastore_(input, job_dict)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 226, in xloader_data_into_datastore_
    tabulator_load()
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 195, in tabulator_load
    logger=logger)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 454, in load_table
    for i, records in enumerate(chunky(result, 250)):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 513, in chunky
    item = list(itertools.islice(it, n))
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 413, in row_iterator
    data_row[headers[index]] = cell
IndexError: list index out of range
duttonw commented 9 months ago

another blocking fast load:

https://www.data.qld.gov.au/dataset/queensland-covid-19-case-line-list-location-source-of-infection/resource/1dbae506-d73c-4c19-b727-e8654b8be95a

'23505'}}
11:13:22,309 WARNI [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Load using COPY failed: Validation error when creating the database table: None - {'constraints': ['Cannot insert records or create indexbecause of uniqueness constraint'], 'info': {'orig': 'duplicate key value violates unique constraint "pg_type_typname_nsp_index"\nDETAIL:  Key (typname, typnamespace)=(1dbae506-d73c-4c19-b727-e8654b8be95a__id_seq, 17092) already exists.\n', 'pgcode': '23505'}}
Trying again with tabulator
11:13:22,317 INFO  [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Trying again with tabulator
Determining column names and types
11:13:22,327 INFO  [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Determining column names and types
load_table: Decoded encoding: {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
ThrawnCA commented 9 months ago

@duttonw I'm not sure how feasible it is to handle columns with the wrong number of commas, but completely blank rows are simple enough. Tabulator has built-in functionality to let us skip them.

duttonw commented 7 months ago

once https://github.com/qld-gov-au/ckanext-xloader/pull/90 reaches /ckan/ckanet-xloader this can be closed.

duttonw commented 6 months ago

resolved in https://github.com/qld-gov-au/ckanext-xloader/pull/90, it will get to ckan org version in due time.