Safecast / safecastapi

The app that powers api.safecast.org
44 stars 25 forks source link

Incremental data export for api measurements #419

Open julovi opened 6 years ago

julovi commented 6 years ago

Hi,

we were wondering what’s the best way to retrieve rows from the safecast API database using JSON. When we give the command

https://api.safecast.org/en-US/measurements?captured_after=2016-11-15+01%3A00&captured_before=2016-11-16+01%3A00

we can see that there are more than 23,000 rows separated in 25 row pages, however when we try

https://api.safecast.org/measurements.json?captured_after=2016-11-15+01%3A00&captured_before=2016-11-16+01%3A00

the JSON results only include the first 25. How could we retrieve all the rows in a single query?

Thanks for your advice,

matschaffer commented 6 years ago

Hi @julovi https://github.com/Safecast/safecastapi/blob/master/app/controllers/measurements_controller.rb#L55 takes a per_page argument as well.

So https://api.safecast.org/measurements.json?captured_after=2016-11-15+01%3A00&captured_before=2016-11-16+01%3A00&per_page=30

Though if you're grabbing large chunks of data, you may want to grab a full export from https://blog.safecast.org/downloads/

Loading via the API is great and flexible for smaller sets, but scraping all the data via the API be very slow and probably slow things down for other people as well.

If you have some specific needs let us know and we can try to come up with the best way to get you the data.

julovi commented 6 years ago

Hi Mat,

We’re currently using Safecast database to create a different set of visualizations as described in the attached PDF. So far, we have been downloading the data periodically, but we want to automatize this process. Our idea is to get the data from the previous day before to include it in our database, would it be possible to generate such a file (instead of a full export) for us to grab it? it would be much easier that way. Thanks for your advice, and have a nice weekend.

Julian Villegas, Ph.D.

eatis2016.pdf

matschaffer commented 6 years ago

@Frangible do you know if we have any previous day exports already? Or something that might make sense for @julovi to use?

If not I can make another job to just do single day exports.

@julovi, is aws s3 ls s3://safecastdata-us-west-2/ingest/prd/s3raw/ of any use to you for now? It's a public bucket with hourly dumps of the data going into ingest (mostly solarcast & pointcast data, not bgeigie drives).

The ingest data also has an SNS queue we could subscribe either your own webhook or SQS queue to.

julovi commented 6 years ago

Hi Mat,

Thanks. We reviewed the AWS option, but the data is in a complete different format that would require a large re-implementation of our system. One question we had was: If these data are incorporated automatically to the database, what happens when errors are discovered after the incorporation?

I think it's easier, not only for us but for anyone trying to mirror Safecast database, to have a daily digest of the transactions in the database (new insertions, modifications, deletions?). Have you heard from Nick (@Frangible) a bout this? is this something already implemented? if not and you guys think this is a good idea, is there anyway we can help to implement it?

Cheers!

matschaffer commented 6 years ago

I'm wondering if it might make sense to make a daily version of https://github.com/Safecast/safecastapi/blob/master/script/dump_clean.sql

As for errors, the latest aspect of our thinking there (especially for streaming data) is covered by https://github.com/Safecast/safecastapi/issues/379 but none of that is implemented yet.

The bgeigie drive data in the feed you're using has at least had someone look at it, but none of our streaming data (solarcast, pointcast, etc) has any data verification today.

julovi commented 6 years ago

Hi Mat, I think the daily dump version is what we're looking for...

I guess something like:

\copy (select id, user_id, captured_at, ST_Y(location::geometry), ST_X(location::geometry), value, unit, device_id from measurements WHERE (measurement_import_id NOT IN ... AND captured_at > localtimestamp - interval '24 hours' AND captured_at < localtimestamp + interval '48 hours' AND captured_at IS NOT NULL ...... value BETWEEN 10.00 AND 30000.0))) OR (unit IN ('microsievert','usv') AND value BETWEEN 0.02 ... to '/tmp/mcleanDaily.csv' csv

but perhaps without the restriction of units to 'microsievert' and 'usv' (we're hoping to show other indicators as well) could do the trick.

any chances of seeing that coming soon?

As for the error correction, I guess it's currently difficult without human intervention...

matschaffer commented 6 years ago

Adding the job is pretty straightfoward, I just need a script like https://github.com/Safecast/safecastapi/blob/master/script/dump_clean as an entry point.

@thinrope has been looking into getting those jobs more cloud-friendly so maybe he could help.

Pull-requests are, of course, welcome from anyone so if you know what you want, just send it along!

thinrope commented 6 years ago

Catching up on this one... So, probably the ultimate goal, mirroring the safecast db is not going to be worked on until we finish moving all infra to AWS. As @matschaffer said, scarping the API is neither efficient, nor encouraged.

What I am working on at the moment is exporting clean datasets from the psql server into CSV files (later bzip2-ed) and having those available via S3. Currently this is to be run as a cron job, probably daily (takes about 20min). Changes (UPDATE/DELETE) are rare, AFAIK, in the database as the dataset ages, so it makes sense to have time-based chinks and update them only if there is a change (i.e. only the latest chunk will see changes often (mostly INSERTs))

Next steps:

The above scripts currently try to distinguish between mobile and fixed sensors, fixed sensors being exported without location, there will be another metadata table for that (currently devices.json).

Not all columns are exported, let me know what you use. Currently:

fixed_sensors.csv

measurement_id,captured_at,CPM,device_id,devicetype_id

24649267,2012-01-02T10:06:34,0.923283773449958,29, ... mobile_sensors.csv

measurement_id,captured_at,CPM,latitude,longitude,import_id

13192913,2011-04-23T22:24:19,39,35.6485,139.78848,1 ...

If you are interested in dose rates (DRE), those are calculated from the CPM column, based on a linear coefficient dependent on the GM tube(devicetype_id) in that device. Unfortunately, those are not in the database for all devices yet, for fixed sensors you need devices.json, for mobile_sensors DRE2CPM=334 (you divide CPM on DRE2CPM, so 33CPM is about 0.1 μSv/h).

Let us know if you need something else. BTW, if you want to get really deep in the data, please make sure you read and understand most of ios_query.sql that includes a lot of filtering of bad data.

julovi commented 6 years ago

Hi Kalin,

Thanks for your detailed response. I understand that there are more pressing issues at the moment and that's fine.

Our goal is to be able to compare cities (different environmental indicators) using Safecast data, that is, not only radiation, but any other measurement when available. I know that at the moment, most of the data is from radiation, but I guess that in the future we'll see more indicators as more sensor kits are deployed. We also don't compare DRE since I found that a bit arbitrary: I think that is fair to assume that radiation around Fukushima plant is produced by the decay of Cesium, but that's not necessarily the case (in my opinion) for radiation found elsewhere in the world. So, we compare CPMs.

What you mentioned about the fixed devices is interesting, would we be able to query a table to retrieve their location? Naively, we were thinking in getting a CVS file

measurement_id,latitude,longitude,captured_at,unit,value

where unit can be a string describing whether the column 'value' refers to CPM, NOX, etc.

There's a student currently working on this project, I guess we could:

  1. Download the database as it is now
  2. Try a SQL to create the table as we need it
  3. Do a pull request?

Does this plan make sense?

Now, regarding the validation of the data, that's something that perhaps needs to be discussed more in detail, so we will basically get the data by its face value, and later on see if there are any modifications (I'm not sure how).

matschaffer commented 6 years ago

Yep, though I wouldn't say you need to download the whole db. Just a local database set up using the instructions at https://github.com/Safecast/safecastapi/blob/master/README.md#development will let you upload a log file or two (can download one from api.safecast.org) and get an idea of what your ideal data export SQL should look like.

thinrope commented 6 years ago

We also don't compare DRE since I found that a bit arbitrary: I think that is fair to assume that radiation around Fukushima plant is produced by the decay of Cesium, but that's not necessarily the case (in my opinion) for radiation found elsewhere in the world. So, we compare CPMs.

Aw, hmm, well, except the first and the last statement here, I completely agree... Comparing DRE is always a sensible thing to do, assuming that it is calculated properly (BTW, it is always calculated, not directly measured). This is where different sensors and measurement techniques provide different precision and resolution; For our initial case (Fukushima 2011, mostly gamma (>1m above ground, pointing 45deg up to horizon), we confirmed that we can calculate DRE from CPM with good enough precision to stay within the sensor raw precision (+/- 15%). This is not true in the general case (I often show people a slightly radioactive lantern mantle touching the sensor: the display shows instantly about 50-70 uSv/h and people get scared; I calm them that if that was true, I would not be holding it with my hand; it is just calculations based on wrong assumptions).

Comparing CPM is ONLY meaningful if at least:

What you mentioned about the fixed devices is interesting, would we be able to query a table to retrieve their location?

Did you look at devices.json, it does contain location?

Naively, we were thinking in getting a CVS file #measurement_id,latitude,longitude,captured_at,unit,value where unit can be a string describing whether the column 'value' refers to CPM, NOX, etc.

Yes, that may be possible (one day) but it is really a very simplistic view of the world. Actually radiation/cpm is one of the easiest things (may be only temperature is simpler). Most other computed measurements are difficult to present with "value, unit" directly from raw data. Even "simple" things like wind: it has speed, direction (in 3D if you are measuring from a baloon) and changes (gusty, constant, etc.). Things get a lot denser when people ask about "air quality"... But this is what we are trying to work on!

So, to sum up, most sensor data contains time-stamped values (simple, or array) in some units as well as metadata describing what and how is being measured, where, by whom, how to interpret it, etc. Some visualisations should be looking into the metadata (e.g. location) to interpret sensor readings.

Currently there are few datasets being exported, please look at the code that creates them, as well as the actual exports.

thinrope commented 5 years ago

Waiting a few days to have the move-to-AWS stabilize, before stabbing this again....

comment from @matschaffer via Slack: we could consider doing it on updated_at. We don’t have an index on updated_at right now, but maybe we should add one. Would probably be worth the ~4gb if we’ll also be using it for timeboxed exports