OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.86k stars 1.95k forks source link

Do not display the entire element returned by "fetch urls" #1440

Open ettorerizza opened 6 years ago

ettorerizza commented 6 years ago

If you click on this Viaf API URL, even your browser will have trouble displaying the returned JSON. Imagine then when you have 100 of them and you use "add column by fetching URLS" in Open Refine... Even with only two URLs, the whole interface slows considerably. This is particularly the case in the transformation window. The preview takes forever to appear and everything is freezed.

screencast

Same problem when extracting source code from a large web page for scraping purpose. I wonder if there would be any way to display in Refine only the first lines of the Json/xml or source code. After all, I doubt that users really read this tag soup. Instead, they certainly use the web developper in their browser to identify the path of the elements they later extract with the parseJson() or parseHtml() functions.

wetneb commented 6 years ago

I think a cap on text length would be reasonable.

Otherwise, one possible workaround is to do the fetching in Python and extract the interesting data in the same go (so, without storing the full response).

ettorerizza commented 6 years ago

@wetneb I had thought of this possibility. After all, we often know in advance what we want to extract from the JSON / XML / HTML. The only case where storing the answer is useful it's when one wants to create several columns from the same response. The ideal would be to have the possibility to store the response, or parse it on the fly with something like:

value.fetchUrl().parseJson()...

thadguidry commented 6 years ago

This sounds like a preference that we should set...and then if users want to override it...they update the preference.

@wetneb So... any thoughts on what a reasonable cap that we can setup in preferences.vt ???

wetneb commented 6 years ago

I'm not sure, maybe something like 1024 characters?

jackyq2015 commented 6 years ago

Cap the text might be hard to suit user's need.

Maybe the extraction can be done from the VIAF side. I will ease the burner from the OR side. Though I am not familiar with the syntax.

wetneb commented 6 years ago

Yes if there is a text cap it should be large by default and configurable by the user.

ostephens commented 6 years ago

On the VIAF side if you use the SRU interface you can limit the number of records returned, but I don't think this is the problem here. The issue is that requesting a single record that has many alternative name representations etc. you get a very large chunk of JSON which you can't limit through the API.

I'm in favour of a user configurable text cap on the amount of text show in a cell with a couple of caveats:

ettorerizza commented 6 years ago

Hello all. Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page. Now, scraping 100 URLS, and even 10, has become a pain. OR crashes or slows down.

wetneb commented 6 years ago

@ettorerizza I was not even aware that we had added pretty-printing there…

ettorerizza commented 5 years ago

Until the problem is solved, I share the better solution I found for the browser to support these tons of heavy HTML or JSON:

1 ° fetch URLs, store the result in a column called for example "HTML_RESULT".

2 ° Click immediately on "View -> Collapse this column".

3 ° Extract the HTML elements that interest you in an empty column by using cells['HTML_RESULT'].value.parseHtml().select(<YOUR JSOUP SELECTOR>)

(or cells['JSON_RESULT'].value.parseJson().pathtotheelement, of course)

antoine2711 commented 4 years ago

@thadguidry, @ostephens, @weblate & @ettorerizza: how about we bring this to the next step:

This would only affect display.

ostephens commented 4 years ago

@antoine2711 I'm generally happy with the idea but I have some concerns about the details:

antoine2711 commented 4 years ago

Yes good questions, @ostephens.

  • Should the max setting be configurable at the top OpenRefine level (applies to all projects)? And/or at Project level? And/or Column level?

For sure at the Column Level. Probably also at either project or host level (or both?! not very complex/time consuming to code). But I would set the default app max at 5K, not 1K.

  • I think we need to support setting the display to unlimited (rather than specify a large number of characters)

Even with showing 10 lines, working with large cells is very unpractical. Here is a test with one cell with a 1000 digits number and another cell with 50x rows of 100 digits numbers. This brought me to think that a feature request for a max column's width is going to be the next step… image

It not really workable, I would say. I had column with far less data, and I ended up deleting them, for convenience.

  • I think it should be possible to set display limits across multiple columns, or all columns in a project easily

Well, if we implement a project default that gets set at the creation/import step, that would be the same, with an added override option. I could imagine very easily someone wanting different limit on several columns in a same project.

Regards, Antoine

thadguidry commented 4 years ago

👍 to allow setting on a per column basis, easily.

tfmorris commented 4 years ago

Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page.

Does anyone know where/when this change was made?

From a practical point of view, thousands of characters displayed for a cell or transformation preview provide no benefit to the user. I'd be fine with truncating them by default, perhaps providing an ellipsis (...) or some other affordance that they could use to display all.