DataShades / ckanext-ingest

GNU Affero General Public License v3.0
0 stars 1 forks source link

Included ingstion page works but throws errors #2

Open canon-cmre-kym-eden opened 1 month ago

canon-cmre-kym-eden commented 1 month ago

Preamble

While trying to figure out implementing my own ingesters I came across some issues, and updated with fixes.

Initial problem

https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/views.py#L42-L44 The result returned by ingest_import_records({}, data) is {'fail': 0, 'success': 1} not a list of ids, consequently the unpack on line 44 returns the keys fail and success and the call to package_show({}, {"id": "fail"}) crashes out with "package not found".

Looking back at ae63b57, the function that became ingest_import_records, did return a list of ids in the past.

Edit

The success/fail count is recorded in https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/artifact.py#L63 but the function calls are passed data (with ids) which is discarded. But switching https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/logic/action.py#L91 to "details" means the rest could be unpacked in the view as expected.

canon-cmre-kym-eden commented 1 month ago

To solve the default behaviour replace https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/views.py#L44-L52 with

if data['report'] == 'details':
  for ingested in result:
    pkg = tk.get_action("package_show")({}, ingested['result']['result'])
    tk.h.flash_success(
      "Success: <a href='{url}'>{title}</a>".format(
      title=pkg["title"],
      url=tk.h.url_for(pkg["type"] + ".read", id=pkg["name"]),
    ),
    True,
  )
else:
  tk.h.flash_success("Success: {success}</br>Failed: {fail}".format(**result), True)

But a more elegant solution would be to pass the Artifacts object out of ingest_import_records and switch on type, or better yet, let each implementation of it return a string representation for consumption by flash_success.

canon-cmre-kym-eden commented 1 month ago

Out of date documentation? https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/README.md?plain=1#L128-L132 Should this return IngestionResult instead of dict?

canon-cmre-kym-eden commented 1 month ago

https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/views.py#L40 CKAN parse_params function is not compatible with the hierarchical StrategyOptions class. https://github.com/DataShades/ckanext-ingest/blob/81dd3a0221fe336454b7def3e32012e101bdaa68/ckanext/ingest/shared.py#L23-L53 It's not possible to pass the nested update_existing option this way, let alone the the form structure.

Edit

I've managed to get it working fine by replacing parse_params with dict, and packing the form parameters correctly (e.g. data['options'] = {'record_options': {'update_existing': data.get('update_existing', False)}})

smotornyuk commented 1 month ago

Thanks, good catch.

This UI page is experimental and will change in the future. I'd recommend to make a copy and adapt it to your needs.

But it shouldn't be so broken, indeed. I adopted your suggestions.

  1. Passing artifacts out of the action is against CKAN's ideas of API actions. But we can initialize artifacts inside the view and pass them into action. Now it's possible to pass initialized artifacts object instead of string stats|details|tmp into ingest_import_records. It also means that you can even implement your version of Artifacts and use it instead of ones, available out of the box.
  2. I did not add a method that renders content for flash messages. I'm thinking about formalizing and documenting the Artifacts structure and documenting it so that something for flash messages may be added in the future. For now, I'll keep this ugly section in the view. BTW, this is still far from ideal from the safety point of view: ingestion process can create something other from datasets, so blindly using package_show is pretty dangerous. But, as I mentioned, this view is experimental and undocumented, so we can afford this risk for now.
  3. Should this return IngestionResult instead of dict?

    IngestionResult IS a dict, so it makes no difference. But I'm thinking about switching to dataclasses in v2 of the extension, so it has sense to call IngestionResult(...) to reduce amount of changes in the future.

  4. Yes, assigning update_existing manually totally makes sense.
canon-cmre-kym-eden commented 1 month ago

Thanks for picking this up. Sorry it's a bit of a stream of consciousness. I wanted to progress but also come back and fix it later. I figured the page was an example from which to build ones own ingestion pipeline, and it was very helpful to have, even in this state.

I like your solution to not being able to pass the types out of the action. Much easier to synchronise the result data with the view. I hadn't considered the risks of rendering the package details. I mainly wanted the details for debugging and wonder if it's worth adding a check for debug/dev mode, so it only renders the details in non-production deployments?

The reason I asked about using explicit type rather than dict was that I seemed to be getting multiple nested levels of {'success': True, 'result': {'success': True, 'result': {...}}, and I thought switching to explicit type, meant it unpacked rather than add the dict into the result of the next level. I'm now pretty sure this isn't the case so, sorry about that, but it would be great switch to types in the future.

Thanks for the feedback and glad you're keeping this plug-in alive. So many useful ones are 9 years abandoned.