Open ettorerizza opened 6 years ago
I think a cap on text length would be reasonable.
Otherwise, one possible workaround is to do the fetching in Python and extract the interesting data in the same go (so, without storing the full response).
@wetneb I had thought of this possibility. After all, we often know in advance what we want to extract from the JSON / XML / HTML. The only case where storing the answer is useful it's when one wants to create several columns from the same response. The ideal would be to have the possibility to store the response, or parse it on the fly with something like:
value.fetchUrl().parseJson()...
This sounds like a preference that we should set...and then if users want to override it...they update the preference.
@wetneb So... any thoughts on what a reasonable cap that we can setup in preferences.vt ???
I'm not sure, maybe something like 1024 characters?
Cap the text might be hard to suit user's need.
Maybe the extraction can be done from the VIAF side. I will ease the burner from the OR side. Though I am not familiar with the syntax.
VIAF in the OCLC API Explorer: https://platform.worldcat.org/api-explorer/VIAF
The SRU syntax used in the API SRUSearch function: http://www.loc.gov/standards/sru/
VIAF self -Document: http://www.viaf.org/processed/search/processed There is "Record XPath:" and "Number of Records:" parameters which can be used for this purpose.
Yes if there is a text cap it should be large by default and configurable by the user.
On the VIAF side if you use the SRU interface you can limit the number of records returned, but I don't think this is the problem here. The issue is that requesting a single record that has many alternative name representations etc. you get a very large chunk of JSON which you can't limit through the API.
I'm in favour of a user configurable text cap on the amount of text show in a cell with a couple of caveats:
Hello all. Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page. Now, scraping 100 URLS, and even 10, has become a pain. OR crashes or slows down.
@ettorerizza I was not even aware that we had added pretty-printing there…
Until the problem is solved, I share the better solution I found for the browser to support these tons of heavy HTML or JSON:
1 ° fetch URLs, store the result in a column called for example "HTML_RESULT".
2 ° Click immediately on "View -> Collapse this column".
3 ° Extract the HTML elements that interest you in an empty column by using cells['HTML_RESULT'].value.parseHtml().select(<YOUR JSOUP SELECTOR>)
(or cells['JSON_RESULT'].value.parseJson().pathtotheelement
, of course)
@thadguidry, @ostephens, @weblate & @ettorerizza: how about we bring this to the next step:
This would only affect display.
@antoine2711 I'm generally happy with the idea but I have some concerns about the details:
Yes good questions, @ostephens.
- Should the max setting be configurable at the top OpenRefine level (applies to all projects)? And/or at Project level? And/or Column level?
For sure at the Column Level. Probably also at either project or host level (or both?! not very complex/time consuming to code). But I would set the default app max at 5K, not 1K.
- I think we need to support setting the display to unlimited (rather than specify a large number of characters)
Even with showing 10 lines, working with large cells is very unpractical. Here is a test with one cell with a 1000 digits number and another cell with 50x rows of 100 digits numbers. This brought me to think that a feature request for a max column's width is going to be the next step…
It not really workable, I would say. I had column with far less data, and I ended up deleting them, for convenience.
- I think it should be possible to set display limits across multiple columns, or all columns in a project easily
Well, if we implement a project default that gets set at the creation/import step, that would be the same, with an added override option. I could imagine very easily someone wanting different limit on several columns in a same project.
Regards, Antoine
👍 to allow setting on a per column basis, easily.
Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page.
Does anyone know where/when this change was made?
From a practical point of view, thousands of characters displayed for a cell or transformation preview provide no benefit to the user. I'd be fine with truncating them by default, perhaps providing an ellipsis (...) or some other affordance that they could use to display all.
If you click on this Viaf API URL, even your browser will have trouble displaying the returned JSON. Imagine then when you have 100 of them and you use "add column by fetching URLS" in Open Refine... Even with only two URLs, the whole interface slows considerably. This is particularly the case in the transformation window. The preview takes forever to appear and everything is freezed.
Same problem when extracting source code from a large web page for scraping purpose. I wonder if there would be any way to display in Refine only the first lines of the Json/xml or source code. After all, I doubt that users really read this tag soup. Instead, they certainly use the web developper in their browser to identify the path of the elements they later extract with the parseJson() or parseHtml() functions.