CartoDB / bigmetadata

BSD 3-Clause "New" or "Revised" License
43 stars 11 forks source link

Reduce number of variables #169

Open michellemho opened 7 years ago

michellemho commented 7 years ago

I think there are too many variables in the Data Observatory. This makes searching, organizing, and navigating the measurements nearly impossible with the systems we have now (the Catalog and the Builder UI). I think we should consider reducing the number of variables to just few hundred key variables per country to ensure quality before adding more quantity.

There are thousands of detailed variables that are unnecessary. Brazil and Australia are especially bad. For example there is a variable in Brazil for "Daughters of 38 year-old Guardian and Spouse" (br.data.Pessoa07_V039) ... this is too much detail. There are thousands more that are similar. Just look how massive this page is! http://cartodb.github.io/bigmetadata/br/age_gender.html

michellemho commented 7 years ago

This reduction issue should be fixed before tackling the names and descriptions (which is described in this issue https://github.com/CartoDB/bigmetadata/issues/168)

michellemho commented 7 years ago

I'll need more feedback on this, but I believe users are most interested in getting total population, population breakdowns by age and gender, median income, and maybe just a handful of other variables.

John hand-curated the variables available for the United States from the American Community Survey (ACS). There are thousands of variables to choose from-- he only included a few hundred. These names and descriptions were manually written into the ETL process in the ColumnsTask of the acs.py file.

juanignaciosl commented 7 years ago

Yep, we can probably begin with a set of generic values that are available for most of the countries and let the most specific ones aside. In the future, we might be interested in providing users the possibility to add them to their accounts only if they're interested in them.

michellemho commented 7 years ago

@juanignaciosl @saleiva @stuartlynn @ethervoid @javitonino @kevin-reilly This is my first pass to come up with bare minimum variables + key variables for the Data Observatory. The main purposes are: 1) consistency across countries and 2) better names and descriptions. When we start adding in censuses, we should align as close as possible to these variables first. I'll use this list to "prune" the existing Data Observatory (especially Australia, Canada, and Brazil). If other variables exist and are available, they should be added on an ad-hoc basis by country.

Minimum demographic variables for all countries:

Age & Gender

Households (Families)

Housing

Key demographic variables (availability varies by country)

Households (Families)

Housing

Income

Employment

Education

Nationality

Race and Ethnicity

Religion

Language

Commerce & Economy

Health

kevin-reilly commented 7 years ago

One thing @stuartlynn and I just spoke about was introducing a "public" tag to the data in the DO. The dataset Michelle lists above would be "public" and everything else would be private. This would allow us to ingest any data we may want for internal purposes but only publish certain sets to Builder users.

This would improve UI performance and, I think, allow us to do some of the smarter filtering we wanted to do.

(cc: @noguerol)

saleiva commented 7 years ago

Do we have any data or log about the actual consumption of measurements?

ethervoid commented 7 years ago

@saleiva AFAIK we don't have metrics about the consumption of measurements. We have to add them to the DS metrics in order to be able to query them.

As a temporal solution we could go through the named maps in Redis and gather the data from the analysis config

ethervoid commented 7 years ago

As a leapfrog I've made an script and gathered some stats from named maps in redis. Here you have a csv file with the id,name,description and number of uses of that measure in analysis in production.

The top five most used measures in analysis are (in descendent order):

id          | us.census.acs.B01003001
hits        | 299
name        | Total Population
description | The total number of all people living in a given geographic area.  This is a very useful catch-all denominator when calculating rates.
id          | es.ine.t1_1
hits        | 229
name        | Total population
id          | us.census.acs.B19301001
hits        | 179
name        | Per Capita Income in the past 12 Months
description | Per capita income is the mean income computed for every man, woman, and child in a particular group. It is derived by dividing the total income of a particular group by the total population.
id          | us.census.acs.B19013001
hits        | 105
name        | Median Household Income in the past 12 Months
description | Within a geographic area, the median income received by every household on a regular basis before payments for personal income taxes, social security, union dues, medicare deductions, etc.  It includes income received from wages, salary, commissions, bonuses, and tips; self-employment income from own nonfarm or farm businesses, including proprietorships and partnerships; interest, dividends, net rental income, royalty income, or income from estates and trusts; Social Security or Railroad Retirement income; Supplemental Security Income (SSI); any cash public assistance or welfare payments from the state or local welfare office; retirement, survivor, or disability benefits; and any other sources of income received regularly such as Veterans' (VA) payments, unemployment and/or worker's compensa
tion, child support, and alimony.
id          | us.census.acs.B23025004
hits        | 76
name        | Employed Population
description | The number of civilians 16 years old and over in each geography who either (1) were "at work," that is, those who did any work at all during the reference week as paid employees, worked in their own business or profession, worked on their own farm, or worked 15 hours or more as unpaid workers on a family farm or in a family business; or (2) were "with a job but not at work," that is, those who did not work during the reference week but had jobs or businesses from which they were temporarily absent due to illness, bad weather, industrial dispute, vacation, or other personal reasons. Excluded from the employed are people whose only activity consisted of work around the house or unpaid volunteer work for religious, charitable, and similar organizations; also excluded are all institutionalized p
eople and people on active duty in the United States Armed Forces.

Hope it helps to know a bit more of the current status of DO. Happy weekend

// @saleiva @juanignaciosl @noguerol for awareness

ethervoid commented 6 years ago

Another round of data:

Top five:

id,name,hits
us.census.acs.B01003001,Total Population,331
es.ine.t1_1,Total population,274
us.census.acs.B19301001,Per Capita Income in the past 12 Months,192
us.census.acs.B19013001,Median Household Income in the past 12 Months,136
us.census.acs.B23025004,Employed Population,86

Complete file is here

stuartlynn commented 6 years ago

@ethervoid that's awesome.

Is it possible to get the full list? Those are the ones that I would guess are most used but the ones further down the list would be interesting as well.

stuartlynn commented 6 years ago

Ah sorry, my bad. I didn't see the CSV file