Establish canonical names for data

hxu commented 10 years ago

Clean up unintelligible labels
Shorten labels that are too long

Current entities in the database are direct extracts from the census tables. Unfortunately this means we have a lot of unintelligible labels, such as "and Over" or "of Quarters" in the column field. This problem seems to be mostly confined to the columns, as these were most likely to be split over multiple rows in the original spreadsheets.

Also, some of the longer columns/tables/rows should probably be shortened.

This should probably be saved as a mapping/dict of raw values to canonical values, available for use in our import scripts.

hupili commented 10 years ago

The new of naming:

All row names and column names get an identifier. This one should be developer friendly.
Provide dict to translate identifier to natural languages.

See following examples and you'll know what I mean.

http://hupili.net/projects/hk_census/data-clean/areas/A05/table12.json

http://hupili.net/projects/hk_census/data-clean/translation.json

Navigate all: http://hupili.net/projects/hk_census/data-clean/

@hxu @2blam , please comment on new schema.

hupili commented 10 years ago

@2blam , can you extend table_meta_data.py for three languages.

2blam commented 10 years ago

Yes, for sure!

hxu commented 10 years ago

Still not quite sure this is what we need. Here are some reasons why:

Access to data is not any simpler - I'm not sure that the identifiers help here. How is the identifier any different from a cell address ("H21")? The idea is that I shouldn't have to access the original data to understand what the data means. If we don't do the normalization on the backend, then we'll have to do it on the frontend anyway, so might as well do it earlier in the data pipeline.
Data is not clean - there are still a lot of empty values, which probably correspond to empty columns on the original spreadsheets. These were probably there for presentation purposes, which we're trying to strip away.
How would this look as a flat table / CSV? Would you store the identifiers? Then the user would need to cross reference against our translation maps or against the old data.

To give you a more concrete idea of what I'm thinking of, I created a spreadsheet here.

Tab Sample CSV or DB shows what the dump of the database or the csv might look like
Tab Mapping XLS to DB shows some examples of canonical names

Here are some reasons why I think something similar to these examples is what we should aim for:

No excess data - don't have to worry about empty strings, etc. Logic for detecting these dynamically in the front end would be complicated
User friend is also developer friendly - wouldn't you prefer to type ?row=40-44&column=male instead of having to remember ?row=h15_40&column=l6_male?
Useful out of the box - If I can download a CSV like in the first tab, then I can immediately throw that into a pivot table and work with it.

Inevitably, we will need to hard code a lot of these mappings. I don't think there is an automated way to do this clean-up. If we don't do it in the backend, then our frontend code will need to do the conversion, and that will make it more complicated.

Happy to discuss more on this, but I feel pretty strongly about rigorously cleaning the data as early as possible in our pipeline. Also happy to help hard code these, since that will be slightly time consuming.

hupili commented 10 years ago

I think we are on the same direction. The spreadsheet you show is a join of our Datapoint and translation.

For Datapoint table (in form of CSV or NDB), I prefer to store "identifiers" there. Or else, we need to have three set of "canonical names" in my understanding. Also, it's better to pass row= h85_clerical than row=Clerical support workers in URLs. I know we can not remember h85_clerical, but those only happen in the background. Users of our app select items from a drop list, which is already translated based on their languages. The same for App developers. After cleaning the translation table and making option lists, those should be the reference, not the original XLS. As for identifier name, I'm not sure what is a good solution. Cell name alone e.g. h85 suffices. The leading word is put there just to reduce possibility of typo.

With all the materials ready, we can provide 4 versions of single CSV. One is exactly Datapoint in our NDB. Another three are Datapoint joined with three translations (including region, district and area names later). Actually, we can also store 4 versions in NDB if people prefer natural language in the API.

There are some cleaning up to do, but the only big problem left is whether the current translations are canonical enough. translation_fix.py provides the mechanism for manually change the names. Maybe you can show some examples, e.g. Loan Repayment (HK$) to Loan Repayment, and we can discuss.

hxu commented 10 years ago

So you are saying:

Store identifiers in DB
Make downloadable datasets (CSV) a join of the DB and the translation map

I think the only part we disagree is what to use as the identifier. In my view, I had wanted a more readable identifier (you would still have to convert to more friendly presentation on the front end). In your view, you prefer something short and programmatic.

To clarify the names, lets call them these:

Raw name - what is found in the original database
Identifier - what is found in our database
Canonical name - what we present in the front end
Translation map - mapping from identifier to canonical name

What if we had something like a slugify function? See Django for an example. We take our canonical name and slugify it to create the identifier? The identifier would be more human readable, but also be more machine-friendly?

clacanzo commented 10 years ago

I would like to try and have my hand at this issue, if this is ok for you guys. I could try to connect what you have described as "Raw Names" and create "Canonical Names" for the Front End, leaving the Identifier to you guys, whatever is more convenient. Could anybody please direct me to the link to the original database? Shall I also include in the list the extra ones for the public facilities? I have browsed around here to find all the spreadsheets but could not find any……

2blam commented 10 years ago

@hupili table_meta_data.py was extended for three languages. Please check backend branch. BTW, I did not remove the entry about "name" as I scare the some program depends on it.

hupili commented 10 years ago

@hxu It's all clear now with this name clarification. Under this language:

Identifier, subject to further discussion of what to use.
current translation map contains information to map from identifier to Canonical name, and Canonical name there is the same as Raw name (with line-breaking issues fixed in translation_fix.py).

If we agree on this setup, @clacanzo may work on translation map, providing a list of Raw name to Canonical name pairs.

hxu commented 10 years ago

@clacanzo the easiest way to do this would probably be to refer to one of the original spreadsheets, such as this one. Look for any table headings, row names, or column names that can be shortened somehow (see this spreadsheet for examples).

If you can write JSON, then you can write it in the format of this file, or you can just put it into a spreadsheet. The spreadsheet should have a column to indicate which cell the label is from, and then a column for the English, Simplified, and Traditional (leave the last two blank if you cannot read Chinese). You can then put the spreadsheet into our shared folder.

hupili commented 10 years ago

For the identifier, slug was considered. I'm afraid it is too long, so just keep a prefix (currently one leading word). Since one word is not unique, so cell name is also prefixed.

I'm also OK with slug. Anyone else want to comment?

hupili commented 10 years ago

@clacanzo I just put the JSON translation to a Google Spreadsheet for your convenience.

BTW, I suggest we all use Google Spreadsheet directly for demo purpose. Those xlsx files do not render very well on Google Drive. People can also comment on the cells when in doubt.

clacanzo commented 10 years ago

OK @hupili thank you, this is very helpful. Is the list comprehensive? I see there are no indicators/names for all the CAs and geographical areas…. don't you need to create canonical names for those? also, there are a lot of "none"…. is there any way for me to see what the indicator points at in the page from the census website (http://www.census2011.gov.hk/en/district-profiles.html)? I actually already started working on spreadsheets collating all the info from the website, but if your Google Spreadsheet is comprehensive, this will simplify the job enormously!

clacanzo commented 10 years ago

this is what I have done so far on the geographical raw names from the website's spreadsheet…. https://docs.google.com/spreadsheet/ccc?key=0AnMgWbxp_0cVdHBZSVg5Mkw2eElaeVFjZkVTV0xIVVE#gid=0 I am starting with English and then adding Chinese, to see if you like the format. I was planning once I have a "master sorting" file done, to then list all references alphabetically and assign unique canonical names to all data. Please let me know if this is useless!!!

hupili commented 10 years ago

@clacanzo , the google spreadsheet is a full enumeration of all possible column names and row names. You can just ignore the "none" ones. Also, you don't have to put canonical name for everyone. It should be enough to just put down those you want to change (canonicalize).

As to geo-naming issues, we are handling in #5 and it turns out we can actually programmatically generate all the identifiers and canonical names.

In order to locate where each row corresponds in original spreadsheet, consider following example:

0   a69_students    S   学生
1   a69_students    E   Students
2   a69_students    T   學生

You can find S/E/T versions in cell "A69" whose contents are 学生, Students, and 學生, respectively.

@hxu , just recall one use case when I prefix cell name yesterday. The above is one. We need a way to refer to original table position sometimes. Canonical names can be different from raw names, so you may not be able to search in the sheet. Then identifier comes into play. Slug alone may also cause trouble if you want to check back the raw data to see what happens.

hupili commented 10 years ago

@clacanzo , just see your google doc. As noted above, #5 is on it and we actually have complete official naming data (overlooked at on hackathon). So the canonical naming we are discussing regards to table columns and rows, which may contain long and less readable strings. Sorry not making it clear.

clacanzo commented 10 years ago

thanks @hupili. All understood. If the a69 is the cell number in the spreadsheet, then it is easy. Leave it with me. I will stop working on the geographic names then! thanks for help

clacanzo commented 10 years ago

Question: if there is a heading that in the raw data goes over two cells (e.g. A121 and A122) do you want me to reduce it to one cell only or keep it in two cells? The second cell is usually not associated to any value data, so I think it's useless and I am presently leaving it blank, moving all meaning to the first cell. Is this correct?

hupili commented 10 years ago

Ah, you just identified a line-breaking for rows.

This example:

由夫婦、其中至少一個父或母親及其未婚                177
子女所組成                              <nothing here>

It will generate two data points:

dp1: row=由夫婦、其中至少一個父或母親及其未婚, value=177
dp2: row=子女所組成, value=none

dp2 will be filtered out right away in the second stage. So we only have row=由夫婦、其中至少一個父或母親及其未婚, value=177 left. You may want to change raw name 由夫婦、其中至少一個父或母親及其未婚 to canonical 由夫婦、其中至少一個父或母親及其未婚子女所組成. As for raw name 子女所組成, just leave it untouched and it shouldn't appear later.

clacanzo commented 10 years ago

ok thanks, @hupili I am leaving all raw data untouched, just indicating the canonicals!

clacanzo commented 10 years ago

done!

gazetteerhk / census_explorer

Establish canonical names for data #2