ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 203 forks source link

Harvest info injection into package info not completing #433

Open jbrown-xentity opened 3 years ago

jbrown-xentity commented 3 years ago

The ckanext-harvest extension has an after_show function that first removes and then inserts the harvest values harvest_object_id, harvest_source_id, and harvest_source_title. This is also implemented in the before_view function (only adding the harvest info). In implementation, this information never comes through because of this if statement. This context.get('validate') always returns None, and this if fails.

I can find no documentation on this validate key inside context, but also cannot find any usage/implementation of it in CKAN other than in this code base. Is it possible this has been deprecated? Should this be changed to reflect similar logic to the before_show?

I know that CKAN suggests to call package_show before calling package_update, but the base harvester does not implement this, and many of data.gov's harvester extensions do not either...

I have tested that changing the if statement to be if harvest_object and not context.get('validate', False): inserts the information correctly. Would this be the right fix? Should I submit a PR?

jbrown-xentity commented 3 years ago

The Failing test PR (#434 ) is a failing test related to this issue. I would expect this test to pass. Still not sure if the right approach is to edit the after_show function, or to let before_view actually run. Not sure why before_run is not running on CKAN >2.1, but probably some missing context...

jbrown-xentity commented 3 years ago

I made updates to the test so that it would pass for our suggested fix; remove the dataset after_show functionality of removal and adding data, and let the "for_view" flag call the correct function and insert the data if it really is only for viewing. Would love some feedback on this; still think there is missing context on why the before_view function has been ignored since CKAN2.2...

amercader commented 3 years ago

Sorry @jbrown-xentity, this is probably way more confusing that it should be. Let's step back a second. I believe that the goal here is to have harvest_object_id, harvest_source_id, and harvest_source_title indexed (ie in Solr). This will include the fields wherever the dataset dict is requested with package_show (API, templates etc). To achieve this we would only need to add them to the dict in the before_index() hook so they get indexed by Solr, and we could remove this logic from after_show , before_show and before_view.

Would you mind trying this simpler approach and see if it works as expected?