Closed ethervoid closed 7 years ago
@ethervoid, the timespan argument for OBS_GetAvailableGeometries will check whether the geometry is valid for the given timespan. For US TIGER geometries, the timespans are single years. For US measures, the timespans are 5 year (2011 - 2015) or 1 year estimates (2015 - 2015) or Quarters (2015Q1). So you need to pass a geometry timespan into the argument when using OBS_GetAvailableGeometries, like:
SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002',
'us.census.acs.B11001001', '2015') denoms WHERE valid_numer IS TRUE AND
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND valid_timespan IS TRUE
ORDER BY numgeoms DESC
@michellemho Yes, geometries are not able to receive range of years but if you make a call to the OBS_GetAvailableTimespan
you're going to get this:
cartodb_staging_user_3c8e3afc-8117-4edb-bc99-eb065d907e76_db=> SELECT timespan_id FROM OBS_GetAvailableTimespans((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
FROM (select * from lightrailstations) q), NULL, 'us.census.acs.B08201002',
'us.census.acs.B11001001') numer WHERE valid_numer IS TRUE AND
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL);
-[ RECORD 1 ]------------
timespan_id | 2006 - 2010
-[ RECORD 2 ]------------
timespan_id | 2010 - 2014
-[ RECORD 3 ]------------
timespan_id | 2011 - 2015
-[ RECORD 4 ]------------
timespan_id | 2015 - 2015
-[ RECORD 5 ]------------
timespan_id | 2014 - 2014
-[ RECORD 6 ]------------
timespan_id | 2010 - 2010
This data is what we're using in the UI to filter geometries with timespan, so if we pass '2015 - 2015' to filter the geometries we're going to get an empty resultset because is waiting for a single year. We're not able to know what year we want to use to filter we were relying in the data returned by OBS_GetAvailableTimespan
@ethervoid I see what you're saying. I think you'll need to use some combination of OBS_GetAvailableTimespans and OBS_GetAvailableGeometries together... There should be a way! Sorry for delay in reply, I'll investigate.
Great! thank you @michellemho for your help! :)
I've been taking a deeper look at this. As @ethervoid said, we have a mismatch between the timespans in numerators/denominators and geometries, for the ACS data. A lot of the following is a repeat of the previous comments.
For numerators, we have timespans in three formats: 2015 - 2015
, 2011 - 2015
and 2015Q2
. All of these should be matching the geometries for 2015 as multi-year data should use the last year. Geometries have the year (2015
) as a timespan.
Eventually, these rules should be encoded in the observatory somehow. My choice would be to do it in metadata, avoiding special code for this things (as the matching rules may vary by country). Anyway, this is not the case right now, as we only have one timespan per geometry level.
Back to the issue, as we only have one timespan per geometry, I think it makes little sense to filter geometries by timespan. This can be achieved just by removing AND valid_timespan IS TRUE
from the original query.
Fully agree, we don't have to make special rules and include/modify the metadata to reflect this
@javitonino, I'm not sure I understand when you say we only have one timespan per geometry level? I think we have multiple timespans for the geometry levels... for example, it looks like we have 2014 and 2015 for the United States tiger shapes.
If I'm understanding this correctly, the goal is to use the correct geometry from the correct year (timespan). For example, a user wants to get Population data from the 2010 - 2014 ACS Survey for US counties. Therefore, 2014 county shapes should be used in this query.
I'm also discovering some weirdness with the timespans for geometries:
SELECT * FROM OBS_GetAvailableTimespans((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
FROM (select * from lightrailstations) q), NULL,NULL,NULL, 'us.census.tiger.county')
WHERE valid_geom is true
I expect to only see single years (like 2015, 2014) because those are the only timespans that are valid for the geom of us.census.tiger.county
... but this is the return. It includes timespans like 2006 - 2010
and 2014Q2
(not in the screenshot).
Hi @michellemho and thank you for helping take a look at that. Regarding geometries and timespans, this is what I'm seeing in the database for the US:
geom_id | timespans | geom_name
----------------------------------------------------+-----------+--------------------------------------------------------
us.census.tiger.block | {2015} | US Census Blocks
us.census.tiger.block_group | {2015} | US Census Block Groups
us.census.tiger.block_group_clipped | {2014} | Shoreline clipped US Census Block Groups
us.census.tiger.cbsa | {2015} | Core Based Statistical Area (CBSA)
us.census.tiger.cbsa_clipped | {2014} | Shoreline clipped Core Based Statistical Area (CBSA)
us.census.tiger.census_tract | {2015} | US Census Tracts
us.census.tiger.census_tract_clipped | {2014} | Shoreline clipped US Census Tracts
us.census.tiger.congressional_district | {2015} | US Congressional Districts
us.census.tiger.congressional_district_clipped | {2014} | Shoreline clipped US Congressional Districts
us.census.tiger.county | {2015} | US County
us.census.tiger.county_clipped | {2014} | Shoreline clipped US County
us.census.tiger.place | {2015} | Incorporated Places
us.census.tiger.place_clipped | {2014} | Shoreline clipped Incorporated Places
us.census.tiger.pointlm_geom | {2016} |
us.census.tiger.prisecroads_geom | {2016} |
us.census.tiger.puma | {2015} | US Census Public Use Microdata Areas
us.census.tiger.puma_clipped | {2014} | Shoreline clipped US Census Public Use Microdata Areas
us.census.tiger.school_district_elementary | {2015} | Elementary School District
us.census.tiger.school_district_elementary_clipped | {2014} | Shoreline clipped Elementary School District
us.census.tiger.school_district_secondary | {2015} | Secondary School District
us.census.tiger.school_district_secondary_clipped | {2014} | Shoreline clipped Secondary School District
us.census.tiger.school_district_unified | {2015} | Unified School District
us.census.tiger.school_district_unified_clipped | {2014} | Shoreline clipped Unified School District
us.census.tiger.state | {2015} | US States
us.census.tiger.state_clipped | {2014} | Shoreline clipped US States
us.census.tiger.zcta5 | {2015} | US Census Zip Code Tabulation Areas
us.census.tiger.zcta5_clipped | {2014} | Shoreline clipped US Census Zip Code Tabulation Areas
us.epa.huc.hydro_unit | {2015} | Subwatershed hydrological unit
us.ny.nyc.columns.parcel | {2016v2} |
As you can see, we have multiple geometries, but only a single timespan for each. In general, it seems we have TIGER geometries from 2015, and the water-clipped versions from 2014.
We only have one date for timespan for each version of the geometry, that's why I was suggesting to ignore timespans for now when determining what geometries are available, until we can improve the metadata.
Regarding your second comment, this looks curious. I am seeing that the timespans associated to a geometry do not always match the geometries associated to timespans. That might be why OBS_GetAvailableTimespans
and OBS_GetAvailableGeometries
are returning inconsistent/confusing results. I'll investigate a bit.
This is the (simplified) query used to generate obs_meta_timespan
:
SELECT numer_timespan::TEXT timespan_id,
ARRAY_AGG(DISTINCT geom_id)::TEXT[] geoms
FROM observatory.obs_meta
GROUP BY numer_timespan;
And this is the one used for obs_meta_geom
:
SELECT geom_id::TEXT,
ARRAY_AGG(DISTINCT geom_timespan)::TEXT[] timespans
FROM observatory.obs_meta
GROUP BY geom_id;
The first one is used for OBS_GetAvailableTimespans
and as you can see, it lists all geometries grouped by numerator timespan. It does no effort to try to filter only geometries that match the timespan of the numerator. This is what caused the strange behaviour seen by @michellemho: that function is not actually listing the timespans for the geometries, but the timespans for the numerators that work on top of those geometries.
The second one is used for OBS_GetAvailableGeometries
and it list all timespans for a geometry, matching by geom_timespan. This returns less results than above, and is the cause that the first query is returning empty, since no geometry timespans match numerator timespans.
More info (this is starting to become a data dump). We have data for US counties for 2014 and 2015:
select * from observatory.obs_column_table where column_id like 'us.census.tiger.county';
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_id | us.census.tiger.county
table_id | us.census.tiger.sum_level_county_2014_fc3b5a4f3c
colname | the_geom
extra | {"stats": {"count": 3233.0, "avg": 3049001323.62662, "mode": 5146977.97199068, "stddev": 9786007602.96718, "notnull": 3233.0, "max": 382813145076.423, "min": 5146977.97199068, "median": 1624139.30889}}
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_id | us.census.tiger.county
table_id | us.census.tiger.sum_level_county_2015_c512ae8318
colname | the_geom
extra | {"stats": {"count": 3233.0, "avg": 3048999599.80874, "mode": 5311308.60284487, "stddev": 9787312848.79033, "notnull": 3233.0, "max": 382779368948.197, "min": 5311308.60284487, "median": 1624015.17715}}
However, only one appears in the metadata:
select geom_id, geom_timespan from observatory.obs_meta where geom_id like 'us.census.tiger.county' group by geom_id, geom_timespan;
-[ RECORD 1 ]-+-----------------------
geom_id | us.census.tiger.county
geom_timespan | 2015
The oldest one is excluded during the metadata generation process: https://github.com/CartoDB/bigmetadata/blob/master/tasks/carto.py#L897 https://github.com/CartoDB/bigmetadata/blob/master/tasks/carto.py#L982
So, then we get to OBS_GetAvailableGeometries
. Why is it returning the county geoemtry twice, if it is only once in the metadata:
SELECT count(*) FROM cdb_observatory.OBS_GetAvailableGeometries(
ST_SetSRID('01030000000100000005000000337EC9B8E24C5AC085EFBD53DCC54340337EC9B8E24C5AC0AB8CF50452EC434070748070212B5AC0AB8CF50452EC434070748070212B5AC085EFBD53DCC54340337EC9B8E24C5AC085EFBD53DCC54340'::geometry,4326),
NULL,
'us.census.acs.B08201002',
'us.census.acs.B11001001',
'2015 - 2015'
) denoms
WHERE
valid_numer IS TRUE AND
(valid_denom IS TRUE OR 'us.census.acs.B11001001' IS NULL) AND
geom_id = 'us.census.tiger.county';
-[ RECORD 1 ]
count | 2
The answer is _OBS_GetGeometryScores
. There is a join with this function to return the scores for the geometry. But _OBS_GetGeometryScores
queries the obs_column
table directory (as opposed to query obs_meta_*
) so it sees the two geometries with two timespans and returns both!
So:
obs_meta
generationOBS_GetAvailableGeometries
eventually triggers both metadata sets, and gets duplicated resultsIMHO, looking at the code, it appears the we decided to ignore timespans in metadata (and it's also the most straight-forward approach). So, to get this working, we would need to:
AND valid_timespan IS TRUE
from the DO UI queryOBS_GetAvailableGeometries
to only use obs_meta_*
tables.For the future, we should look into an strategy for geometries with different timespans, but I think that goes out of the current scope: trying to get DO UI to work.
Possible fix:
WITH _data as (
SELECT * FROM OBS_GetAvailableGeometries((SELECT ST_SetSRID(ST_Extent(the_geom), 4326)
FROM ({{ query }}) q), NULL, {{ numer_id }},
{{ denom_id }}, {{ timespan }}) denoms WHERE valid_numer IS TRUE AND
(valid_denom IS TRUE OR {{ denom_id }} IS NULL)
ORDER BY numgeoms DESC
) SELECT DISTINCT on (geom_id) * FROM _data;
Closing because the objective of this issue was solved which was to bring a solution for the UI to get the geometries for a numerator/denominator/timespan. We don't filter by time-span. Other issues have arisen and issues have been created for all of them
@nobuti has found an issue with geom and the timespans, we have data that when we filter by numer and denom we get null. The used query, with this dataset:
with the following result:
checking the internal data:
but if we use that timespan id to filter the geom, we get an empty result because the timespan is stored differently. The numer:
the denom
and the geom for that numer and denom:
Looks like the geoms are stored by year so we're not able to filter by geoms using the timespan_id. But we have other cases:
here you can see timespan id like
2012Q1
which can't be used to filter geoms because these are the available timespans for geom data:What do you think @michellemho @stuartlynn did we misunderstand anything?