PopulateTools / gobierto

Plataforma de gobierno abierto open source
https://gobierto.es
GNU Affero General Public License v3.0
74 stars 32 forks source link

Data / Dataset default LIMIT when dataset is not too big #3370

Closed furilo closed 3 years ago

furilo commented 4 years ago

The other day makind a demo I found myself modifying the default query (removing the LIMIT 50 and executing again) because when going to browse the data in the date or do a visualization, the LIMIT 50 makes no sense.

This got me thinking that if we are going to cache the default dataset (which equals to the default query without the LIMIT) and if the dataset is not too big (so we don't make the user download lots of data) maybe it makes sense to remove the LIMIT 50.

For example, if a complete dataset is less than XX MBs, we can remove the LIMIT 50 so you already have the complete dataset and can start visualizing it. If its bigger, we leave the LIMIT 50.

Also, if we have the LIMIT 50 I think we should warn the user (ie. "This is a preview of the data. Download it or modify the default query to work with all the data").

And if we do the LIMIT 50 removal, we should change the Perspective call in order to use the cached URL.

Thoughts?

entantoencuanto commented 4 years ago

I agree with the use of a limit depending on the size of the dataset. On monday we talked about storing the cached directly in the filesystem or s3, without using the model for attachments. In this case it would be convenient at least to store the size of the cache in the dataset to avoid an extra request to get the file size.

furilo commented 4 years ago

OK, so we would:

ferblape commented 4 years ago

What if instead of checking the size in MB, it checks the total rows? If nrows > 1000 then add the LIMIT

I'd suggest to store the number of rows (or the size) in the dataset in a kind of metadata field (that could be a jsonb). This field in the future could store the result of the scrutinizer (and if I'm not wrong the number of rows was one of the results obtained by the scrutinizer). And this field should be updated together with the cache

furilo commented 4 years ago

What if instead of checking the size in MB, it checks the total rows? If nrows > 1000 then add the LIMIT

What matters to us is the size of the data being transferred, and rows might not reflect that with precision.

I'd suggest to store the number of rows (or the size) in the dataset in a

kind of metadata field (that could be a jsonb). This field in the future could store the result of the scrutinizer (and if I'm not wrong the number of rows was one of the results obtained by the scrutinizer). And this field should be updated together with the cache

Oh yes, I forgot of the scrutinizer.

furilo commented 4 years ago

Now that we have the cache for the default dataset, we can proceed with this change, right?

ferblape commented 4 years ago

Yes

On Fri, 16 Oct 2020 at 15:41, Álvaro Ortiz notifications@github.com wrote:

Now that we have the cache for the default dataset, we can proceed with this change, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PopulateTools/gobierto/issues/3370#issuecomment-710054364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEJUAWMPVLJUCRFQ2EQDLSLBEQZANCNFSM4RZSGNZA .

-- Fernando Blat fernando@populate.tools +34 660825001

Populate / Tools for civic engagement https://populate.tools

Project stories twitter.com/populate_ & populate.tools/blog

furilo commented 4 years ago

@entantoencuanto do we already have the cache.size or similar in the API response?

entantoencuanto commented 4 years ago

Not yet, I'm going to create an issue for that

furilo commented 3 years ago

Ping