OpenRefine / CommonsExtension

An OpenRefine extension that helps with Wikimedia Commons editing: start projects from Wikimedia Commons categories; Commons-specific GREL functions.
BSD 3-Clause "New" or "Revised" License
14 stars 9 forks source link

Bug: extractFromTemplate and value.extractCategories GREL functions produce empty columns #61

Open trnstlntk opened 2 years ago

trnstlntk commented 2 years ago

I have been trying the extractFromTemplate and value.extractCategories GREL functions in various projects. Both work well in the GREL preview dialog window:

image image

But then after clicking OK, in the project itself, both produce an empty column. I haven't been able to get it to work in any project for now, but just for testing purposes, here's a project in which it went wrong: Barbalissos.openrefine.tar.gz

wetneb commented 2 years ago

This is because your expression returns an array, not a single value, and arrays are silently discarded when creating columns out of expressions: https://github.com/OpenRefine/OpenRefine/issues/1088

wetneb commented 2 years ago

(see also https://github.com/OpenRefine/OpenRefine/issues/4823, which would be one of my preferred ways to improve this)

Concretely, what you can do on your side is use extractTemplate(value, "Information", "Description")[0].

wetneb commented 2 years ago

Or we could decide that this extractFromTemplate function should not return an array, but only its first result. That makes it impossible to fetch other results, in cases where there are more than one matches, so from a programmer's perspective it is a bit disappointing, but perhaps you want to prioritize having a simpler expression.

You could do the same extractCategories and it would only return the first category of the page - that sounds even worse than for extractFromTemplate since files routinely have multiple categories and there is no reason why the first one should be more interesting than the others, so intuitively it is worth explaining to users that arrays exist and how to deal with them, but that's my very biased programmer perspective :-P

trnstlntk commented 1 year ago

Or we could decide that this extractFromTemplate function should not return an array, but only its first result. That makes it impossible to fetch other results, in cases where there are more than one matches, so from a programmer's perspective it is a bit disappointing, but perhaps you want to prioritize having a simpler expression.

You could do the same extractCategories and it would only return the first category of the page - that sounds even worse than for extractFromTemplate since files routinely have multiple categories and there is no reason why the first one should be more interesting than the others, so intuitively it is worth explaining to users that arrays exist and how to deal with them, but that's my very biased programmer perspective :-P

I'm finally getting around to documenting this. I will go for the pragmatic approach, providing end users with easy-to-reuse recipes, as I'm noticing that onboarding / learning the whole OpenRefine workflow is already pretty challenging for average Wikimedians.

As an exercise, I tried to come up a workaround myself which will be helpful for others too, but I'm not sure yet if I found the smartest solution. Is something like value.extractCategories()[0,10].toString() a decent workaround, or would you recommend something even nicer? (The 0-10 to catch a lot of values; and the toString to circumvent the 'OpenRefine won't do arrays in cells' issue.)

wetneb commented 1 year ago

I would recommend more something like value.extractCategories().join('#') which should join categories with a # symbol between them, such as Category:Art#Category:Spain#Category:Blue. Then, users can easily split those values into multiple cells / columns using the corresponding functions in OpenRefine.