HEPData / hepdata

Repository for main HEPData web application
https://hepdata.net
GNU General Public License v2.0
40 stars 11 forks source link

records: check size of data tables and additional resource text files #136

Closed GraemeWatt closed 10 months ago

GraemeWatt commented 6 years ago

For very large tables possibly containing thousands of rows, currently only the first 50 rows are displayed initially and the user needs to click to see all rows. But all rows are initially loaded by the browser and plotted, which can result in a long delay (of a minute or more) when initially loading a record and when switching between tables. Instead, only the first 50 rows should be loaded and plotted initially, then the user would need to click to load all rows and display them in the table and on the generated plot.

In more detail, get_table_details would first call generate_table_structure to return only the first 50 rows, by in turn calling process_independent_variables and process_dependent_variables for only the first 50 rows. Then only the first 50 rows would be rendered in the table and plot. If the user clicked "Show All values", the web page would be reloaded and get_table_details would load all rows of the table.

Alternatively, don't load, display and visualise a table by default, only when clicking a button. Or simpler, just put some maximum cut-off on the number of rows of a table for it to be rendered in a browser.

GraemeWatt commented 1 year ago

Rather than only loading the first 50 rows of a data table, a simpler approach would be to check the file size before loading a data table. If it is greater than some (configurable) value like 1 MB, the independent_variables and dependent_variables could be set to empty lists [] and a message displayed like: "This is a large data table (x.y MB). Do you want to display it?". If the user clicked "Yes" the data table would be loaded in the usual way. YAML data files are now restricted to be less than 10 MB as part of the validation, but this was not always the case. Some examples of large data tables:

https://www.hepdata.net/record/ins1798511?version=1&table=Table%2022-23%20statistical%20correlations https://www.hepdata.net/record/ins1630886?version=3&table=Table%204 https://www.hepdata.net/record/ins1740909?version=2 (Tables 185 to 188)

A related problem is with large text files attached as additional resources either to a record or an individual table which are rendered in a browser via resource_details.html. Here, there is no restriction on size made as part of the validation. A check should be made on the file size, and if it is greater than some (configurable) value like 1 MB, the file should only be made available for download but not rendered in a browser. Examples:

https://www.hepdata.net/record/ins2013051?version=1 (large YAML files attached as additional publication resources) https://www.hepdata.net/record/ins2077557?version=1 (HistFactory JSON files attached as additional publication resources)

GraemeWatt commented 11 months ago

@ItIsJordan : The first table (1.6 MB) of https://www.hepdata.net/record/ins1869138?version=2 was reported by a user as causing problems due to its large size, so it might provide a good example for testing while not being excessively large.

We agreed last time we met in my office to change the limit SIZE_LOAD_CHECK_THRESHOLD = 1000000 to the binary equivalent of one megabyte SIZE_LOAD_CHECK_THRESHOLD = 1048576. For testing purposes, you could temporarily reduce this value to enable testing with a reasonably small data table.

Instead of "This table is too large to load automatically.", I prefer my suggested text from the comment above: "This is a large data table (x.y MB). Do you want to display it?" where x.y MB is replaced by the table size.

I forgot that there's a second part of this issue to suppress display of large additional resource text files. See the second paragraph of the comment above beginning "A related problem....".