Open michellemho opened 7 years ago
This reduction issue should be fixed before tackling the names and descriptions (which is described in this issue https://github.com/CartoDB/bigmetadata/issues/168)
I'll need more feedback on this, but I believe users are most interested in getting total population, population breakdowns by age and gender, median income, and maybe just a handful of other variables.
John hand-curated the variables available for the United States from the American Community Survey (ACS). There are thousands of variables to choose from-- he only included a few hundred. These names and descriptions were manually written into the ETL process in the ColumnsTask of the acs.py file.
Yep, we can probably begin with a set of generic values that are available for most of the countries and let the most specific ones aside. In the future, we might be interested in providing users the possibility to add them to their accounts only if they're interested in them.
@juanignaciosl @saleiva @stuartlynn @ethervoid @javitonino @kevin-reilly This is my first pass to come up with bare minimum variables + key variables for the Data Observatory. The main purposes are: 1) consistency across countries and 2) better names and descriptions. When we start adding in censuses, we should align as close as possible to these variables first. I'll use this list to "prune" the existing Data Observatory (especially Australia, Canada, and Brazil). If other variables exist and are available, they should be added on an ad-hoc basis by country.
Age & Gender
Households (Families)
Housing
Households (Families)
Housing
Income
Employment
Education
Nationality
Race and Ethnicity
Religion
Language
Commerce & Economy
Health
One thing @stuartlynn and I just spoke about was introducing a "public" tag to the data in the DO. The dataset Michelle lists above would be "public" and everything else would be private. This would allow us to ingest any data we may want for internal purposes but only publish certain sets to Builder users.
This would improve UI performance and, I think, allow us to do some of the smarter filtering we wanted to do.
(cc: @noguerol)
Do we have any data or log about the actual consumption of measurements?
@saleiva AFAIK we don't have metrics about the consumption of measurements. We have to add them to the DS metrics in order to be able to query them.
As a temporal solution we could go through the named maps in Redis and gather the data from the analysis config
As a leapfrog I've made an script and gathered some stats from named maps in redis. Here you have a csv file with the id,name,description and number of uses of that measure in analysis in production.
The top five most used measures in analysis are (in descendent order):
id | us.census.acs.B01003001
hits | 299
name | Total Population
description | The total number of all people living in a given geographic area. This is a very useful catch-all denominator when calculating rates.
id | es.ine.t1_1
hits | 229
name | Total population
id | us.census.acs.B19301001
hits | 179
name | Per Capita Income in the past 12 Months
description | Per capita income is the mean income computed for every man, woman, and child in a particular group. It is derived by dividing the total income of a particular group by the total population.
id | us.census.acs.B19013001
hits | 105
name | Median Household Income in the past 12 Months
description | Within a geographic area, the median income received by every household on a regular basis before payments for personal income taxes, social security, union dues, medicare deductions, etc. It includes income received from wages, salary, commissions, bonuses, and tips; self-employment income from own nonfarm or farm businesses, including proprietorships and partnerships; interest, dividends, net rental income, royalty income, or income from estates and trusts; Social Security or Railroad Retirement income; Supplemental Security Income (SSI); any cash public assistance or welfare payments from the state or local welfare office; retirement, survivor, or disability benefits; and any other sources of income received regularly such as Veterans' (VA) payments, unemployment and/or worker's compensa
tion, child support, and alimony.
id | us.census.acs.B23025004
hits | 76
name | Employed Population
description | The number of civilians 16 years old and over in each geography who either (1) were "at work," that is, those who did any work at all during the reference week as paid employees, worked in their own business or profession, worked on their own farm, or worked 15 hours or more as unpaid workers on a family farm or in a family business; or (2) were "with a job but not at work," that is, those who did not work during the reference week but had jobs or businesses from which they were temporarily absent due to illness, bad weather, industrial dispute, vacation, or other personal reasons. Excluded from the employed are people whose only activity consisted of work around the house or unpaid volunteer work for religious, charitable, and similar organizations; also excluded are all institutionalized p
eople and people on active duty in the United States Armed Forces.
Hope it helps to know a bit more of the current status of DO. Happy weekend
// @saleiva @juanignaciosl @noguerol for awareness
Another round of data:
id,name,hits
us.census.acs.B01003001,Total Population,331
es.ine.t1_1,Total population,274
us.census.acs.B19301001,Per Capita Income in the past 12 Months,192
us.census.acs.B19013001,Median Household Income in the past 12 Months,136
us.census.acs.B23025004,Employed Population,86
Complete file is here
@ethervoid that's awesome.
Is it possible to get the full list? Those are the ones that I would guess are most used but the ones further down the list would be interesting as well.
Ah sorry, my bad. I didn't see the CSV file
I think there are too many variables in the Data Observatory. This makes searching, organizing, and navigating the measurements nearly impossible with the systems we have now (the Catalog and the Builder UI). I think we should consider reducing the number of variables to just few hundred key variables per country to ensure quality before adding more quantity.
There are thousands of detailed variables that are unnecessary. Brazil and Australia are especially bad. For example there is a variable in Brazil for "Daughters of 38 year-old Guardian and Spouse" (br.data.Pessoa07_V039) ... this is too much detail. There are thousands more that are similar. Just look how massive this page is! http://cartodb.github.io/bigmetadata/br/age_gender.html