Closed furilo closed 3 years ago
I agree with the use of a limit depending on the size of the dataset. On monday we talked about storing the cached directly in the filesystem or s3, without using the model for attachments. In this case it would be convenient at least to store the size of the cache in the dataset to avoid an extra request to get the file size.
OK, so we would:
What if instead of checking the size in MB, it checks the total rows? If nrows > 1000 then add the LIMIT
I'd suggest to store the number of rows (or the size) in the dataset in a kind of metadata field (that could be a jsonb). This field in the future could store the result of the scrutinizer (and if I'm not wrong the number of rows was one of the results obtained by the scrutinizer). And this field should be updated together with the cache
What if instead of checking the size in MB, it checks the total rows? If nrows > 1000 then add the LIMIT
What matters to us is the size of the data being transferred, and rows might not reflect that with precision.
I'd suggest to store the number of rows (or the size) in the dataset in a
kind of metadata field (that could be a jsonb). This field in the future could store the result of the scrutinizer (and if I'm not wrong the number of rows was one of the results obtained by the scrutinizer). And this field should be updated together with the cache
Oh yes, I forgot of the scrutinizer.
Now that we have the cache for the default dataset, we can proceed with this change, right?
Yes
On Fri, 16 Oct 2020 at 15:41, Álvaro Ortiz notifications@github.com wrote:
Now that we have the cache for the default dataset, we can proceed with this change, right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PopulateTools/gobierto/issues/3370#issuecomment-710054364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEJUAWMPVLJUCRFQ2EQDLSLBEQZANCNFSM4RZSGNZA .
-- Fernando Blat fernando@populate.tools +34 660825001
Populate / Tools for civic engagement https://populate.tools
Project stories twitter.com/populate_ & populate.tools/blog
@entantoencuanto do we already have the cache.size or similar in the API response?
Not yet, I'm going to create an issue for that
Ping
The other day makind a demo I found myself modifying the default query (removing the LIMIT 50 and executing again) because when going to browse the data in the date or do a visualization, the LIMIT 50 makes no sense.
This got me thinking that if we are going to cache the default dataset (which equals to the default query without the LIMIT) and if the dataset is not too big (so we don't make the user download lots of data) maybe it makes sense to remove the LIMIT 50.
For example, if a complete dataset is less than XX MBs, we can remove the LIMIT 50 so you already have the complete dataset and can start visualizing it. If its bigger, we leave the LIMIT 50.
Also, if we have the LIMIT 50 I think we should warn the user (ie. "This is a preview of the data. Download it or modify the default query to work with all the data").
And if we do the LIMIT 50 removal, we should change the Perspective call in order to use the cached URL.
Thoughts?