gazetteerhk / census_explorer

Explore Hong Kong's neighborhoods through visualizations of census data
http://gazetteer.hk
MIT License
42 stars 12 forks source link

Integrate public facility data #9

Closed hxu closed 10 years ago

hxu commented 10 years ago

Standardize and integrate the HK public facility dataset

hxu commented 10 years ago

Looks like this is scriptable. Captcha can be bypassed by directly GETing http://www1.map.gov.hk/gih3/PSI.do?action=downloadFile&filename=csv/BC.csv. The CSV name is embedded on the onclick attribute of the dataset listing page on the map

Other headers on the request:

GET /gih3/PSI.do?action=downloadFile&filename=csv/AIDED_PRS.csv&authCode=XZIMY9&unit=opt9&purpose=opt8 HTTP/1.1
Host: www1.map.gov.hk
Connection: keep-alive
Cache-Control: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Pragma: no-cache
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.46 Safari/537.36
Referer: http://www1.map.gov.hk/gih3/view/index.jsp
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: JSESSIONID=A2E78CDBC014E49259D3C3900C439D88; kp.egis.gov.hk=rd1o00000000000000000000ffff0a58890co80
clacanzo commented 10 years ago

how would you integrate this data? just showing the number of facilities per CA, together with all the other stats available on census? Or you actually want to show each facility on a map? the tasks are very different as the results are…

also there are a lot of sub-groups (=categories) in the public facilities, doesn't this create a nightmare in integrating the data into the database? should we think of maybe re-grouping them in less sections or just consider the more relevant ones?

hxu commented 10 years ago

The use case is to be able to calculate things like the following statistics:

It is actually a bit more complicated, since people travel to other areas to use these services, so we probably want to do something like calculating the number of facilities within a certain distance from a CA.

Once we have these statistics, we can look for correlations between educational attainment, family income, etc.

You are correct that there are a few classes of public facilities that we may want to focus on. This would be up to whoever takes on this task to assess (e..g we probably don't care about "Vehicle Examination Centers" for now).

2blam commented 10 years ago

@hxu I can help to prepare the CSV data and convert the coordinates to WSG 1984

2blam commented 10 years ago

I uploaded the public facilities data (original csv files; WSG84 GeoJSON files) to google drive. Please check.

hxu commented 10 years ago

Nice, looks good. Are you going to keep working on this? If so, we should probably talk about the strategy for collecting these into a single table or something.

Thanks!

On Mon, Feb 3, 2014 at 12:12 PM, 2blam notifications@github.com wrote:

I uploaded the public facilities data (original csv files; WSG84 GeoJSON files) to google drive. Please check.

— Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-33923738 .

2blam commented 10 years ago

Yes, let's see if I can help. For your information, each csv file has different number of columns. It seems that Column A to Column J are consistent among 79 csv files. I need to write some script to verify if that's true.

hxu commented 10 years ago

Here is the script for downloading the files: https://github.com/2blam/HK-Geo-referenced-Public-Facility-Data. It uses OCR to pass the captcha. Approximately 1 hour to run.

hupili commented 10 years ago

:+1: @2blam , don't know the download scripts are already released for a while. I have the following proposal to merge it.

The idea is to emulate another "table" in the original sense. Then we can seamlessly merge it with existing datapoints and start pivoting using the current backend API.

Suggest to output the following table:

area    table        row                          column       value
-------------------------------------
a01     facility      has_library               value          1
a02     facility      has_library               value          0
...
a01     facility      num_wifi                   value          10
a01     facility      num_wifi                   value          12
...
a01     facility      num_park                   value          10
a01     facility      num_park                   value          12
...

The "column" column is just a dummy placeholder. As to type of facility, we can distinguish by "row" or by "table". Both should be equally convenient to application. The exact location of a facility does not matter too much. Our unit geo-location is area. So we can pre-process them in to numbers for each area. As to region and district columns of datapoints, we can easily merge later.

I think @2blam can continue to do the pre-processing on 2blam/HK-Geo-referenced-Public-Facility-Data , since this is a different data set. I can submodule the repo after it is finished and merge those data points.

What do you think?

2blam commented 10 years ago

The public facility data was combined into a single excel file yesterday. Also, the data was converted into GeoJSON format. You can have a look about backend_pub_facility branch.

Refer to the proposed table, the pre-processed information can save the query time (shorten response time) and can use another approach to visualize the information (pie chart?), I think it is worth to try.

If we go ahead about this, then I will try to write a script to determine if a public facility fall into a particular constituency area.

hxu commented 10 years ago

@hupili Are you saying that instead of storing each individual facility, we store only the aggregate counts? In other words, given a CSV of all of the facilities:

For each area:
    For each facility type:
        Count the number of facilities of this type in this area
        Create a single row in database where "row" value is the facility type

I think this is OK for our immediate use. I think eventually we may want to provide a facilities api, much as we did for the census data that preserves as much of the original data as possible.

2blam commented 10 years ago

I uploaded all_pub_facility_with_CACODE.json to google drive. Please check.

The json file

Here is the list of properties for each public facility: ENGLISH ADDRESS, TELEPHONE, FAX NUMBER, OPENING HOURS, 中文地址, ENGLISH NAME, EMAIL ADDRESS, 中文名稱, WEBSITE, ENGLISH CATEGORY, 中文類別 , CACODE

I need some time to tidy up my script, will share in the backend_pub_facility branch later. For your information, I adopted [2] to determine if the point located in a particular polygon.

Note:

Reference [1] - http://www.census2011.gov.hk/pdf/maps/Map_KC.pdf [2] - http://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/pnpoly.html

hupili commented 10 years ago

@2blam , it looks great. Since there are only 9K data points, which is small compared with original table, we can keep them all and use current API to aggregate (add a count aggregator) them. I can manage the conversion from your GeoJSON to DB.

2blam commented 10 years ago

Two scripts (processPubFacilityCSV.py and addCACODEProperty.py) were added in backend_pub_facility branch. I found that LIBRARY_LCSD_20131213.csv Row# 187 has some problem, the English Name of such entry with extra \t and duplicated name. I fixed this error and zip all the csv files again with file name csv_files_err_fixed.zip. You can find this zip file in google drive (hk_pub_facility/csv/).

hupili commented 10 years ago

@2blam , is it possible to streamline all the operations from download to final generation? Like what data_preparation.py does. For those we need manual fix, we can use a similar strategy as translation_fix.py.

2blam commented 10 years ago

One of the major problem is the step about using QGIS to convert the csv (HK Grid Coordinates) to GeoJSON (GPS coordinates). At the moment, this step need to do manually. If we can find a command to do this conversion, then we would create an automated script from download to final generation.

On Wed, Feb 12, 2014 at 10:30 AM, HU, Pili notifications@github.com wrote:

@2blam https://github.com/2blam , is it possible to streamline all the operations from download to final generation? Like what data_preparation.py does. For those we need manual fix, we can use a similar strategy as translation_fix.py.

Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-34832872 .

hxu commented 10 years ago

This shell command will do it, but you need GDAL/OGR installed:

ogr2ogr -f "GeoJSON" OUTPUT_FILE INPUT_FILE -t_srs EPSG:4326 -s_srs EPSG:2326

EPSG:2326 is HK1980 projection, and EPSG:4326 is WGS84

Oops, I just realized it's only for shapefiles. I think I remember seeing a similar command somewhere for CSVs, though, let me check.

hxu commented 10 years ago

Apparently ogr2ogr can be used to read CSV files, but you need to generate a VRT file and reformat the coordinates in the CSV to well known text. See this question, and the docs it links to 1 2

2blam commented 10 years ago

Wow, thx for the reference. Will have a look about this.

Apparently ogr2ogr can be used to read CSV files, but you need to generate a VRT file and reformat the coordinates in the CSV to well known text. See this questionhttp://gis.stackexchange.com/questions/24947/how-can-i-convert-a-csv-file-of-wkt-data-to-a-shape-file-using-ogr2ogr, and the docs it links to 1 http://www.gdal.org/ogr/drv_vrt.html 2http://www.gdal.org/ogr/drv_csv.html

Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-34834576 .

2blam commented 10 years ago

This command works: ogr2ogr -f GeoJSON OUTPUT_FILENAME -s_srs "EPSG:2326" PATH_TO_VRT_FILE -t_srs "EPSG:4326"

Will update the scripts accordingly.

2blam commented 10 years ago

I combined the conversion scripts into a single python file. Please check backend_pub_facility branch. Thanks.

hxu commented 10 years ago

OK, so I just reviewed the code here, and I think the strategy I will take is to pre-summarize to get per constituency area counts of various facility types. So the new "table" (in the sense of the census tables in our source data) will look like this:

@hupili proposed storing the point coordinates as the value, but I don't think we get much benefit from storing the point coordinates without also capturing the metadata (facility name, etc.), since we'd likely want to show those when mapping. So, just storing the counts is a shortcut that basically requires no changes to the frontend, so we can get this feature out the door before the end of the week.

I will write the integration code so that if the facilities data is available, it'll be added to the generation of the various data files.