Reduce number of variables

michellemho commented 7 years ago

I think there are too many variables in the Data Observatory. This makes searching, organizing, and navigating the measurements nearly impossible with the systems we have now (the Catalog and the Builder UI). I think we should consider reducing the number of variables to just few hundred key variables per country to ensure quality before adding more quantity.

There are thousands of detailed variables that are unnecessary. Brazil and Australia are especially bad. For example there is a variable in Brazil for "Daughters of 38 year-old Guardian and Spouse" (br.data.Pessoa07_V039) ... this is too much detail. There are thousands more that are similar. Just look how massive this page is! http://cartodb.github.io/bigmetadata/br/age_gender.html

michellemho commented 7 years ago

This reduction issue should be fixed before tackling the names and descriptions (which is described in this issue https://github.com/CartoDB/bigmetadata/issues/168)

michellemho commented 7 years ago

I'll need more feedback on this, but I believe users are most interested in getting total population, population breakdowns by age and gender, median income, and maybe just a handful of other variables.

John hand-curated the variables available for the United States from the American Community Survey (ACS). There are thousands of variables to choose from-- he only included a few hundred. These names and descriptions were manually written into the ETL process in the ColumnsTask of the acs.py file.

juanignaciosl commented 7 years ago

Yep, we can probably begin with a set of generic values that are available for most of the countries and let the most specific ones aside. In the future, we might be interested in providing users the possibility to add them to their accounts only if they're interested in them.

michellemho commented 7 years ago

@juanignaciosl @saleiva @stuartlynn @ethervoid @javitonino @kevin-reilly This is my first pass to come up with bare minimum variables + key variables for the Data Observatory. The main purposes are: 1) consistency across countries and 2) better names and descriptions. When we start adding in censuses, we should align as close as possible to these variables first. I'll use this list to "prune" the existing Data Observatory (especially Australia, Canada, and Brazil). If other variables exist and are available, they should be added on an ad-hoc basis by country.

Minimum demographic variables for all countries:

Age & Gender

Total Population
Male Population
Female Population
Population by age groups (varies country to country)
Population by age groups and gender (varies country to country)

Households (Families)

Total households

Housing

Total housing units

Key demographic variables (availability varies by country)

Households (Families)

Average household size
Number of households by size
Number of people by marriage status (single, married, divorced, separated, widowed)
Number of households or families with children

Housing

Occupied housing
Owner-occupied housing
Renter-occupied housing
Vacant housing
Number of housing units by type (apartment, semi-attached, etc.)
Number of housing units by year built
Number of housing units by size (1 bedroom, 2 bedroom, etc.)

Income

Median household income
Number of people in poverty or receiving public assistance

Employment

Economically active population
Employed population
Unemployed population
Economically inactive population

Education

Number of enrolled students by level
Number of people by educational attainment

Nationality

Population by place of birth

Race and Ethnicity

Population by race and ethnicity groups

Religion

Number of people by religion

Language

Number of people by language spoken at home

Commerce & Economy

Number of businesses by industry

Health

Life expectancy
Birth rate
Death or mortality rate
Number of people with health insurance by type

kevin-reilly commented 7 years ago

One thing @stuartlynn and I just spoke about was introducing a "public" tag to the data in the DO. The dataset Michelle lists above would be "public" and everything else would be private. This would allow us to ingest any data we may want for internal purposes but only publish certain sets to Builder users.

This would improve UI performance and, I think, allow us to do some of the smarter filtering we wanted to do.

(cc: @noguerol)

saleiva commented 7 years ago

Do we have any data or log about the actual consumption of measurements?

ethervoid commented 7 years ago

@saleiva AFAIK we don't have metrics about the consumption of measurements. We have to add them to the DS metrics in order to be able to query them.

As a temporal solution we could go through the named maps in Redis and gather the data from the analysis config

ethervoid commented 7 years ago

As a leapfrog I've made an script and gathered some stats from named maps in redis. Here you have a csv file with the id,name,description and number of uses of that measure in analysis in production.

The top five most used measures in analysis are (in descendent order):

id          | us.census.acs.B01003001
hits        | 299
name        | Total Population
description | The total number of all people living in a given geographic area.  This is a very useful catch-all denominator when calculating rates.

id          | es.ine.t1_1
hits        | 229
name        | Total population

id          | us.census.acs.B19301001
hits        | 179
name        | Per Capita Income in the past 12 Months
description | Per capita income is the mean income computed for every man, woman, and child in a particular group. It is derived by dividing the total income of a particular group by the total population.

id          | us.census.acs.B19013001
hits        | 105
name        | Median Household Income in the past 12 Months
description | Within a geographic area, the median income received by every household on a regular basis before payments for personal income taxes, social security, union dues, medicare deductions, etc.  It includes income received from wages, salary, commissions, bonuses, and tips; self-employment income from own nonfarm or farm businesses, including proprietorships and partnerships; interest, dividends, net rental income, royalty income, or income from estates and trusts; Social Security or Railroad Retirement income; Supplemental Security Income (SSI); any cash public assistance or welfare payments from the state or local welfare office; retirement, survivor, or disability benefits; and any other sources of income received regularly such as Veterans' (VA) payments, unemployment and/or worker's compensa
tion, child support, and alimony.

id          | us.census.acs.B23025004
hits        | 76
name        | Employed Population
description | The number of civilians 16 years old and over in each geography who either (1) were "at work," that is, those who did any work at all during the reference week as paid employees, worked in their own business or profession, worked on their own farm, or worked 15 hours or more as unpaid workers on a family farm or in a family business; or (2) were "with a job but not at work," that is, those who did not work during the reference week but had jobs or businesses from which they were temporarily absent due to illness, bad weather, industrial dispute, vacation, or other personal reasons. Excluded from the employed are people whose only activity consisted of work around the house or unpaid volunteer work for religious, charitable, and similar organizations; also excluded are all institutionalized p
eople and people on active duty in the United States Armed Forces.

Hope it helps to know a bit more of the current status of DO. Happy weekend

// @saleiva @juanignaciosl @noguerol for awareness

ethervoid commented 6 years ago

Another round of data:

Top five:

id,name,hits
us.census.acs.B01003001,Total Population,331
es.ine.t1_1,Total population,274
us.census.acs.B19301001,Per Capita Income in the past 12 Months,192
us.census.acs.B19013001,Median Household Income in the past 12 Months,136
us.census.acs.B23025004,Employed Population,86

Complete file is here

stuartlynn commented 6 years ago

@ethervoid that's awesome.

Is it possible to get the full list? Those are the ones that I would guess are most used but the ones further down the list would be interesting as well.

stuartlynn commented 6 years ago

Ah sorry, my bad. I didn't see the CSV file

CartoDB / bigmetadata