CartoDB / observatory-extension

BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Doubts about timespans filtering for OBS_GetAvailableGeom #289

Closed ethervoid closed 7 years ago

ethervoid commented 7 years ago

@nobuti has found an issue with geom and the timespans, we have data that when we filter by numer and denom we get null. The used query, with this dataset:

SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) 
FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002', 
'us.census.acs.B11001001', '2015 - 2015') denoms WHERE valid_numer IS TRUE AND 
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND valid_timespan IS TRUE 
ORDER BY numgeoms DESC

with the following result:

cartodb_staging_user_3c8e3afc-8117-4edb-bc99-eb065d907e76_db=> SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002', 'us.census.acs.B11001001', '2015 - 2015') denoms WHERE valid_numer IS TRUE AND (valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND valid_timespan IS TRUE ORDER BY numgeoms DESC;
(0 rows)

checking the internal data:

cartodb_staging_user_3c8e3afc-8117-4edb-bc99-eb065d907e76_db=> SELECT timespan_id, timespan_name FROM OBS_GetAvailableTimespans((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002', 'us.census.acs.B11001001') denoms WHERE valid_numer IS TRUE AND (valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) ORDER BY timespan_name DESC
;
-[ RECORD 1 ]-+------------
timespan_id   | 2015 - 2015
timespan_name | 2015 - 2015
-[ RECORD 2 ]-+------------
timespan_id   | 2014 - 2014
timespan_name | 2014 - 2014
-[ RECORD 3 ]-+------------
timespan_id   | 2011 - 2015
timespan_name | 2011 - 2015
-[ RECORD 4 ]-+------------
timespan_id   | 2010 - 2014
timespan_name | 2010 - 2014
-[ RECORD 5 ]-+------------
timespan_id   | 2010 - 2010
timespan_name | 2010 - 2010
-[ RECORD 6 ]-+------------
timespan_id   | 2006 - 2010
timespan_name | 2006 - 2010

but if we use that timespan id to filter the geom, we get an empty result because the timespan is stored differently. The numer:

obs_2017_05_04_f20ee1131c=# select numer_id, timespans from observatory.obs_meta_numer where numer_id = 'us.census.acs.B08201002';
-[ RECORD 1 ]------------------------------------------------------------------------------------
numer_id  | us.census.acs.B08201002
timespans | {"2006 - 2010","2010 - 2010","2010 - 2014","2011 - 2015","2014 - 2014","2015 - 2015"}

the denom

-[ RECORD 1 ]------------------------------------------------------------------------------------
denom_id  | us.census.acs.B11001001
timespans | {"2006 - 2010","2010 - 2010","2010 - 2014","2011 - 2015","2014 - 2014","2015 - 2015"}

and the geom for that numer and denom:

obs_2017_05_04_f20ee1131c=# select timespans FROm observatory.obs_meta_geom where 'us.census.acs.B08201002' = ANY(numers) and 'us.census.acs.B11001001' = ANY(denoms) group by timespans;
-[ RECORD 1 ]-----
timespans | {2015}
-[ RECORD 2 ]-----
timespans | {2014}

Looks like the geoms are stored by year so we're not able to filter by geoms using the timespan_id. But we have other cases:

-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------
numer_id  | us.bls.avg_wkly_wage_10
timespans | {2012Q1,2012Q2,2012Q3,2012Q4,2013Q1,2013Q2,2013Q3,2013Q4,2014Q1,2014Q2,2014Q3,2014Q4,2015Q1,2015Q2,2015Q3,2015Q4}
denoms    | {NULL}
geoms     | {us.census.tiger.county,us.census.tiger.county_clipped}

here you can see timespan id like 2012Q1 which can't be used to filter geoms because these are the available timespans for geom data:

obs_2017_05_04_f20ee1131c=# select timespans from observatory.obs_meta_geom group by timespans;
-[ RECORD 1 ]---------
timespans | {2011}
-[ RECORD 2 ]---------
timespans | {2010}
-[ RECORD 3 ]---------
timespans | {2013}
-[ RECORD 4 ]---------
timespans | {20150101}
-[ RECORD 5 ]---------
timespans | {2016}
-[ RECORD 6 ]---------
timespans | {2014}
-[ RECORD 7 ]---------
timespans | {2015}
-[ RECORD 8 ]---------
timespans | {2016v2}

What do you think @michellemho @stuartlynn did we misunderstand anything?

michellemho commented 7 years ago

@ethervoid, the timespan argument for OBS_GetAvailableGeometries will check whether the geometry is valid for the given timespan. For US TIGER geometries, the timespans are single years. For US measures, the timespans are 5 year (2011 - 2015) or 1 year estimates (2015 - 2015) or Quarters (2015Q1). So you need to pass a geometry timespan into the argument when using OBS_GetAvailableGeometries, like:

SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) 
FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002', 
'us.census.acs.B11001001', '2015') denoms WHERE valid_numer IS TRUE AND 
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND valid_timespan IS TRUE 
ORDER BY numgeoms DESC
ethervoid commented 7 years ago

@michellemho Yes, geometries are not able to receive range of years but if you make a call to the OBS_GetAvailableTimespan you're going to get this:

cartodb_staging_user_3c8e3afc-8117-4edb-bc99-eb065d907e76_db=> SELECT timespan_id FROM OBS_GetAvailableTimespans((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002',
'us.census.acs.B11001001') numer WHERE valid_numer IS TRUE AND
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL);
-[ RECORD 1 ]------------
timespan_id | 2006 - 2010
-[ RECORD 2 ]------------
timespan_id | 2010 - 2014
-[ RECORD 3 ]------------
timespan_id | 2011 - 2015
-[ RECORD 4 ]------------
timespan_id | 2015 - 2015
-[ RECORD 5 ]------------
timespan_id | 2014 - 2014
-[ RECORD 6 ]------------
timespan_id | 2010 - 2010

This data is what we're using in the UI to filter geometries with timespan, so if we pass '2015 - 2015' to filter the geometries we're going to get an empty resultset because is waiting for a single year. We're not able to know what year we want to use to filter we were relying in the data returned by OBS_GetAvailableTimespan

michellemho commented 7 years ago

@ethervoid I see what you're saying. I think you'll need to use some combination of OBS_GetAvailableTimespans and OBS_GetAvailableGeometries together... There should be a way! Sorry for delay in reply, I'll investigate.

ethervoid commented 7 years ago

Great! thank you @michellemho for your help! :)

javitonino commented 7 years ago

I've been taking a deeper look at this. As @ethervoid said, we have a mismatch between the timespans in numerators/denominators and geometries, for the ACS data. A lot of the following is a repeat of the previous comments.

For numerators, we have timespans in three formats: 2015 - 2015, 2011 - 2015 and 2015Q2. All of these should be matching the geometries for 2015 as multi-year data should use the last year. Geometries have the year (2015) as a timespan.

Eventually, these rules should be encoded in the observatory somehow. My choice would be to do it in metadata, avoiding special code for this things (as the matching rules may vary by country). Anyway, this is not the case right now, as we only have one timespan per geometry level.

Back to the issue, as we only have one timespan per geometry, I think it makes little sense to filter geometries by timespan. This can be achieved just by removing AND valid_timespan IS TRUE from the original query.

ethervoid commented 7 years ago

Fully agree, we don't have to make special rules and include/modify the metadata to reflect this

michellemho commented 7 years ago

@javitonino, I'm not sure I understand when you say we only have one timespan per geometry level? I think we have multiple timespans for the geometry levels... for example, it looks like we have 2014 and 2015 for the United States tiger shapes.

If I'm understanding this correctly, the goal is to use the correct geometry from the correct year (timespan). For example, a user wants to get Population data from the 2010 - 2014 ACS Survey for US counties. Therefore, 2014 county shapes should be used in this query.

michellemho commented 7 years ago

I'm also discovering some weirdness with the timespans for geometries:

SELECT * FROM OBS_GetAvailableTimespans((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) 
FROM (select * from lightrailstations) q), NULL,NULL,NULL, 'us.census.tiger.county')
WHERE valid_geom is true

I expect to only see single years (like 2015, 2014) because those are the only timespans that are valid for the geom of us.census.tiger.county... but this is the return. It includes timespans like 2006 - 2010 and 2014Q2 (not in the screenshot).

image

javitonino commented 7 years ago

Hi @michellemho and thank you for helping take a look at that. Regarding geometries and timespans, this is what I'm seeing in the database for the US:

                      geom_id                       | timespans |                       geom_name                        
----------------------------------------------------+-----------+--------------------------------------------------------
 us.census.tiger.block                              | {2015}    | US Census Blocks
 us.census.tiger.block_group                        | {2015}    | US Census Block Groups
 us.census.tiger.block_group_clipped                | {2014}    | Shoreline clipped US Census Block Groups
 us.census.tiger.cbsa                               | {2015}    | Core Based Statistical Area (CBSA)
 us.census.tiger.cbsa_clipped                       | {2014}    | Shoreline clipped Core Based Statistical Area (CBSA)
 us.census.tiger.census_tract                       | {2015}    | US Census Tracts
 us.census.tiger.census_tract_clipped               | {2014}    | Shoreline clipped US Census Tracts
 us.census.tiger.congressional_district             | {2015}    | US Congressional Districts
 us.census.tiger.congressional_district_clipped     | {2014}    | Shoreline clipped US Congressional Districts
 us.census.tiger.county                             | {2015}    | US County
 us.census.tiger.county_clipped                     | {2014}    | Shoreline clipped US County
 us.census.tiger.place                              | {2015}    | Incorporated Places
 us.census.tiger.place_clipped                      | {2014}    | Shoreline clipped Incorporated Places
 us.census.tiger.pointlm_geom                       | {2016}    | 
 us.census.tiger.prisecroads_geom                   | {2016}    | 
 us.census.tiger.puma                               | {2015}    | US Census Public Use Microdata Areas
 us.census.tiger.puma_clipped                       | {2014}    | Shoreline clipped US Census Public Use Microdata Areas
 us.census.tiger.school_district_elementary         | {2015}    | Elementary School District
 us.census.tiger.school_district_elementary_clipped | {2014}    | Shoreline clipped Elementary School District
 us.census.tiger.school_district_secondary          | {2015}    | Secondary School District
 us.census.tiger.school_district_secondary_clipped  | {2014}    | Shoreline clipped Secondary School District
 us.census.tiger.school_district_unified            | {2015}    | Unified School District
 us.census.tiger.school_district_unified_clipped    | {2014}    | Shoreline clipped Unified School District
 us.census.tiger.state                              | {2015}    | US States
 us.census.tiger.state_clipped                      | {2014}    | Shoreline clipped US States
 us.census.tiger.zcta5                              | {2015}    | US Census Zip Code Tabulation Areas
 us.census.tiger.zcta5_clipped                      | {2014}    | Shoreline clipped US Census Zip Code Tabulation Areas
 us.epa.huc.hydro_unit                              | {2015}    | Subwatershed hydrological unit
 us.ny.nyc.columns.parcel                           | {2016v2}  | 

As you can see, we have multiple geometries, but only a single timespan for each. In general, it seems we have TIGER geometries from 2015, and the water-clipped versions from 2014.

We only have one date for timespan for each version of the geometry, that's why I was suggesting to ignore timespans for now when determining what geometries are available, until we can improve the metadata.

Regarding your second comment, this looks curious. I am seeing that the timespans associated to a geometry do not always match the geometries associated to timespans. That might be why OBS_GetAvailableTimespans and OBS_GetAvailableGeometries are returning inconsistent/confusing results. I'll investigate a bit.

javitonino commented 7 years ago

This is the (simplified) query used to generate obs_meta_timespan:

SELECT numer_timespan::TEXT timespan_id,
       ARRAY_AGG(DISTINCT geom_id)::TEXT[] geoms
FROM observatory.obs_meta
GROUP BY numer_timespan;

And this is the one used for obs_meta_geom:

SELECT geom_id::TEXT,
         ARRAY_AGG(DISTINCT geom_timespan)::TEXT[] timespans
  FROM observatory.obs_meta
  GROUP BY geom_id;

The first one is used for OBS_GetAvailableTimespans and as you can see, it lists all geometries grouped by numerator timespan. It does no effort to try to filter only geometries that match the timespan of the numerator. This is what caused the strange behaviour seen by @michellemho: that function is not actually listing the timespans for the geometries, but the timespans for the numerators that work on top of those geometries.

The second one is used for OBS_GetAvailableGeometries and it list all timespans for a geometry, matching by geom_timespan. This returns less results than above, and is the cause that the first query is returning empty, since no geometry timespans match numerator timespans.

javitonino commented 7 years ago

More info (this is starting to become a data dump). We have data for US counties for 2014 and 2015:

select * from observatory.obs_column_table where column_id like 'us.census.tiger.county';
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_id | us.census.tiger.county
table_id  | us.census.tiger.sum_level_county_2014_fc3b5a4f3c
colname   | the_geom
extra     | {"stats": {"count": 3233.0, "avg": 3049001323.62662, "mode": 5146977.97199068, "stddev": 9786007602.96718, "notnull": 3233.0, "max": 382813145076.423, "min": 5146977.97199068, "median": 1624139.30889}}
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_id | us.census.tiger.county
table_id  | us.census.tiger.sum_level_county_2015_c512ae8318
colname   | the_geom
extra     | {"stats": {"count": 3233.0, "avg": 3048999599.80874, "mode": 5311308.60284487, "stddev": 9787312848.79033, "notnull": 3233.0, "max": 382779368948.197, "min": 5311308.60284487, "median": 1624015.17715}}

However, only one appears in the metadata:

select geom_id, geom_timespan from observatory.obs_meta where geom_id like 'us.census.tiger.county' group by geom_id, geom_timespan;
-[ RECORD 1 ]-+-----------------------
geom_id       | us.census.tiger.county
geom_timespan | 2015

The oldest one is excluded during the metadata generation process: https://github.com/CartoDB/bigmetadata/blob/master/tasks/carto.py#L897 https://github.com/CartoDB/bigmetadata/blob/master/tasks/carto.py#L982

So, then we get to OBS_GetAvailableGeometries. Why is it returning the county geoemtry twice, if it is only once in the metadata:

SELECT count(*) FROM cdb_observatory.OBS_GetAvailableGeometries(
  ST_SetSRID('01030000000100000005000000337EC9B8E24C5AC085EFBD53DCC54340337EC9B8E24C5AC0AB8CF50452EC434070748070212B5AC0AB8CF50452EC434070748070212B5AC085EFBD53DCC54340337EC9B8E24C5AC085EFBD53DCC54340'::geometry,4326),
  NULL,
  'us.census.acs.B08201002',
  'us.census.acs.B11001001',
  '2015 - 2015'
) denoms
WHERE 
  valid_numer IS TRUE AND
  (valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND
  geom_id = 'us.census.tiger.county';
-[ RECORD 1 ]
count | 2

The answer is _OBS_GetGeometryScores. There is a join with this function to return the scores for the geometry. But _OBS_GetGeometryScores queries the obs_column table directory (as opposed to query obs_meta_*) so it sees the two geometries with two timespans and returns both!

So:

IMHO, looking at the code, it appears the we decided to ignore timespans in metadata (and it's also the most straight-forward approach). So, to get this working, we would need to:

For the future, we should look into an strategy for geometries with different timespans, but I think that goes out of the current scope: trying to get DO UI to work.

ethervoid commented 7 years ago

Possible fix:

WITH _data as (
    SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
    FROM ({{ query }}) q), NULL, {{ numer_id }},
    {{ denom_id }}, {{ timespan }}) denoms WHERE valid_numer IS TRUE AND
    (valid_denom IS TRUE OR {{ denom_id }} IS NULL)
    ORDER BY numgeoms DESC
) SELECT DISTINCT on (geom_id) * FROM _data;
ethervoid commented 7 years ago

Closing because the objective of this issue was solved which was to bring a solution for the UI to get the geometries for a numerator/denominator/timespan. We don't filter by time-span. Other issues have arisen and issues have been created for all of them