Streamline the data extraction scripts

hxu commented 10 years ago

Currently the data extraction scripts follow these steps:

Download individual XLS from Census website (constituency_area_data.py)
Convert XLS to JSON files (extract_data_from_xls_to_json.py)
Upload individual JSON files to the database server (upload-all.sh)

Ideally, the extraction process should probably look like this:

Download individual XLS
Combine and normalize XLS into a single CSV/XLS -- this can be provided as a download for anyone who wants the full dataset
Upload the single CSV/XLS to the database server

I think it is important that the normalization of the data (see #2) should be a part of the process, otherwise the data that we provide is still not as user-friendly as it could be.

2blam commented 10 years ago

I have a suggestion. Instead of updating the current extraction scripts, is it possible to create scripts for generating JSON/XLS/CSV files since the current DB now contains all the data points? On Jan 28, 2014 3:51 PM, "hxu" notifications@github.com wrote:

Currently the data extraction scripts follow these steps:

Download individual XLS from Census website ( constituency_area_data.py)

Convert XLS to JSON files (extract_data_from_xls_to_json.py)

Upload individual JSON files to the database server (upload-all.sh)

Ideally, the extraction process should probably look like this:

Download individual XLS

Combine and normalize XLS into a single CSV/XLS -- this can be provided as a download for anyone who wants the full dataset

Upload the single CSV/XLS to the database server

I think it is important that the normalization of the data (see #2https://github.com/hxu/hk_census_explorer/issues/2) should be a part of the process, otherwise the data that we provide is still not as user-friendly as it could be.

Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/3 .

hxu commented 10 years ago

We could, but I think we need something that works from beginning to end from scratch -- that is, if at some point we lose the database, then our scripts should still be able to work.

I think it would be cleaner to refine the raw XLS to database conversion instead of modifying the existing database.

hxu commented 10 years ago

@hupili @2blam are you planning on working on this? I'd be happy to take this and #2 if you want to work on something else.

2blam commented 10 years ago

I am open. It would be great if I can work with @hupil on this part again.

I am also interested in #2 and #9 (Integrate public facility data).

For #2, I can help to prepare the mapping JSON, just let me know what format you want.

For #9, I can help to prepare the data. But, I don't know how to load the data to the database in GAE, this part might need help.

On Tue, Jan 28, 2014 at 4:57 PM, hxu notifications@github.com wrote:

@hupili https://github.com/hupili @2blam https://github.com/2blam are you planning on working on this? I'd be happy to take this and #2https://github.com/hxu/hk_census_explorer/issues/2if you want to work on something else.

Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/3#issuecomment-33460769 .

hupili commented 10 years ago

I can work on the 2nd step, i.e. from raw xls to combined CSV + canonical names + identifier-to-presentation mappings. The single CSV will corresponds to the new data point model ( https://github.com/hxu/hk_census_explorer/issues/11#issuecomment-33460913 )

hupili commented 10 years ago

For the 3rd step, do you mean to refactor current upload mechanism or to use GAE's bulk upload feature?

One question is whether the bulk upload is implemented in a different way than direct put(). If it is just a web UI in the front, it does not buy us too much. We can directly deploy with the CSV and use a similar mechanism like tasks/populate_ca. The CSV is estimated to be 5M-6M.

hxu commented 10 years ago

OK -- the idea behind the raw XLS to combined CSV is that the CSV should be usable for someone who is not a programmer -- I should be able to just download the CSV and start playing with it in Excel or something similar.

For canonical names, can we commit but don't repopulate database? We should get a few eyes on the wordings before we deploy it, as this is also a bit of a UX issue.

On new DB model -- sounds good to me.

On upload method -- I have no opinion. If we can do it like tasks.populate_constituency_areas then that may be better, since it would be consistent. If we cannot send the CSV up with the app package, then maybe we can use Cloud Storage or just Github and have the server fetch from there.

Lets focus on #2 and #3, and save #9 for a bit later.

hxu commented 10 years ago

@2blam sorry can only assign one person to this, but feel free to work with @hupili if you guys can coordinate.

2blam commented 10 years ago

No problem! @hupili just let me anything I can help.

hupili commented 10 years ago

@hxu the online DB will not be touched. It needs to stand there to support gh-pages. I suppose a good time is at the next offline meet, where we can merge things, fix problem resulted in distributed modification and get things up.

gazetteerhk / census_explorer

Streamline the data extraction scripts #3