IMCR-Hackathon / datapie

Data Package Interface for Evaluation ("Easy as pie!")
https://imcr-hackathon.github.io/datapie/
MIT License
3 stars 2 forks source link

how to deal with large (many observations) datasets #44

Open atn38 opened 5 years ago

atn38 commented 5 years ago

With a data table of ~400k observations and 60 variables, the complete static report takes upwards of 10 mins to complete. Does the dynamic plotting functionality have the same challenges? Could we do something with large datasets to reduce load? Randomly sample the dataset then generate report from that sample?

CoastalPlainSoils commented 5 years ago

Hmmm good question. I have no idea. However, I think if it takes that long to complete, there should definitely be something to let the user know the site is processing the request and if it is possible, the app could give an estimate time frame for completion?!

clnsmth commented 5 years ago

I'm moving the conversation of #11 here.

clnsmth commented 5 years ago

I suggest this be an optional argument rather than an arbitrary limit to number of rows that can be read in. If performance and wait times are a concern for users, we could address this issue by supplying the user with a status bar (which has proven difficult to do) or inform the user of limitations through an expectation matrix (suggested here) in the package documentation.

clnsmth commented 5 years ago

@atn38, since the UI team has figured out how to return messages from a function to the GUI, you could add messages to each static report function to inform the user of status.

Alternatively, as @CoastalPlainSoils suggests, you may be able to create a progress bar using the progress package.

wetlandscapes commented 5 years ago

I kind of like the idea of being able to randomly sample a large data set. In that context, some useful options would be:

  1. Indicate the % of the dataset (rows) to be explored. There would be an indicator of the resultant rows returned from the sample.
  2. Set a seed. This would allow someone to generate the same report twice.
sheilasaia commented 5 years ago

add printing to console for report status on data summary tab. @wetlandscapes will give this a go!

sheilasaia commented 5 years ago

can i also add that we might want to limit the size of the download to someone's computer too? for example, warn them (and maybe stop download) if they're about to download a huge .shp file.

clnsmth commented 5 years ago

I suggest the random sampling and warnings become enhancements to be implemented after the production release. Until then, file size issues can be communicated in the GUI messages and project docs. Note: A user will have to find a data package to use with datapie in DataONE first, where the file size information is clearly presented.