fjelltopp / zarr-ckan

The Government of Zambia Ministry of Finance and National Planning project to establish a Zambia Evaluation and Research Repository (ZaRR) based on CKAN (funded by UNICEF).
0 stars 0 forks source link

The metadata and the resources in the repository can be copied or migrated to other systems #48

Open ChasNelson1990 opened 3 months ago

ChasNelson1990 commented 3 months ago

The metadata and the resources in the repository can be copied or migrated to other systems

A-Souhei commented 2 months ago

May be someone already did this with other Fjelltopp CKAN project?

ChasNelson1990 commented 2 months ago

@A-Souhei - perhaps have a read of this: https://ngr.coar-repositories.org/behaviour/resource-transfer/

I think there could be an argument that the datastore API does this for CSV-style data? What about textual data though - is there an existing solution?

A-Souhei commented 2 months ago

@ChasNelson1990 based on your comment, if I understand well, my idea would be:

  1. Upon resource upload (if it is a textual data) extract the text
  2. Indexing it in SolR
  3. Allow text search using the CKAN API to query SolR

Also, could integrating a small LLM help translate the text for better / more results? (just had it in mind, may be not so good)

Otherwise, we can use the CKAN API query resources and resources metadata : https://docs.ckan.org/en/latest/api/legacy-api.html

ChasNelson1990 commented 2 months ago

Has anybody made an existing ckanext that processes text files like that?

The LLM is a good idea too - but too big for ZaRR

A-Souhei commented 2 months ago

@ChasNelson1990 https://github.com/stadt-karlsruhe/ckanext-extractor looks promising.

ChasNelson1990 commented 2 months ago

@A-Souhei - that extension is quite old (even the language they are using is really old CKAN stuff). There is fork here where somebody has updated it for CKAN 2.9, but it may not work for 2.10.

If you wanted to, I would spend 1 -- 2 hours installing it, adding it to the local dev config and just uploading one file to see if it works. If it looks like it works then great, we can invest some time in making sure it's up-to-date... but if you can't get it working in an hour then we should discuss further whether this is right solution.

A-Souhei commented 2 months ago

Alright, I'll create a ticket for it.

ChasNelson1990 commented 2 months ago

Just use this ticket @A-Souhei

A-Souhei commented 1 month ago

@ChasNelson1990 while I was able to install the plugin without issue, extracting the extractors using the command ckan -c /etc/ckan/production.ini extractor extract all generates an issue in file https://github.com/dathere/ckanext-extractor/blob/master/ckanext/extractor/cli.py . The generated error stacktrace is :

2024-08-14 09:33:43,978 INFO  [ckanext.extractor.cli] Extraction started ...
0414096d-6160-4c69-b5f7-2a9d8ad1d40c: Traceback (most recent call last):
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/lib/navl/dictization_functions.py", line 246, in convert
    nargs = converter.__code__.co_argcount
AttributeError: type object 'str' has no attribute '__code__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/ckan/venv/bin/ckan", line 8, in <module>
    sys.exit(ckan())
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/src/ckanext-extractor/ckanext/extractor/cli.py", line 78, in extract
    result = extract(context, {'id': id, 'force': force})
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/logic/__init__.py", line 580, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/src/ckanext-extractor/ckanext/extractor/logic/helpers.py", line 42, in wrapped
    return f(context, data_dict)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/logic/__init__.py", line 678, in wrapper
    data_dict, errors = _validate(data_dict, schema, context)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/lib/navl/dictization_functions.py", line 305, in validate
    flat_data, errors = _validate(flattened, schema, validators_context)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/lib/navl/dictization_functions.py", line 356, in _validate
    convert(converter, key, converted_data, errors, context)
  File "/usr/lib/ckan/.minikubevenv/ckan-ALitmJXH/lib/python3.8/site-packages/ckan/lib/navl/dictization_functions.py", line 248, in convert
    raise TypeError(
TypeError: str cannot be used as validator because it is not a user-defined function

I'll need more time to investigate.