Closed hxu closed 10 years ago
Looks like this is scriptable. Captcha can be bypassed by directly GET
ing http://www1.map.gov.hk/gih3/PSI.do?action=downloadFile&filename=csv/BC.csv
. The CSV name is embedded on the onclick
attribute of the dataset listing page on the map
Other headers on the request:
GET /gih3/PSI.do?action=downloadFile&filename=csv/AIDED_PRS.csv&authCode=XZIMY9&unit=opt9&purpose=opt8 HTTP/1.1
Host: www1.map.gov.hk
Connection: keep-alive
Cache-Control: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Pragma: no-cache
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.46 Safari/537.36
Referer: http://www1.map.gov.hk/gih3/view/index.jsp
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: JSESSIONID=A2E78CDBC014E49259D3C3900C439D88; kp.egis.gov.hk=rd1o00000000000000000000ffff0a58890co80
how would you integrate this data? just showing the number of facilities per CA, together with all the other stats available on census? Or you actually want to show each facility on a map? the tasks are very different as the results are…
also there are a lot of sub-groups (=categories) in the public facilities, doesn't this create a nightmare in integrating the data into the database? should we think of maybe re-grouping them in less sections or just consider the more relevant ones?
The use case is to be able to calculate things like the following statistics:
It is actually a bit more complicated, since people travel to other areas to use these services, so we probably want to do something like calculating the number of facilities within a certain distance from a CA.
Once we have these statistics, we can look for correlations between educational attainment, family income, etc.
You are correct that there are a few classes of public facilities that we may want to focus on. This would be up to whoever takes on this task to assess (e..g we probably don't care about "Vehicle Examination Centers" for now).
@hxu I can help to prepare the CSV data and convert the coordinates to WSG 1984
I uploaded the public facilities data (original csv files; WSG84 GeoJSON files) to google drive. Please check.
Nice, looks good. Are you going to keep working on this? If so, we should probably talk about the strategy for collecting these into a single table or something.
Thanks!
On Mon, Feb 3, 2014 at 12:12 PM, 2blam notifications@github.com wrote:
I uploaded the public facilities data (original csv files; WSG84 GeoJSON files) to google drive. Please check.
— Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-33923738 .
Yes, let's see if I can help. For your information, each csv file has different number of columns. It seems that Column A to Column J are consistent among 79 csv files. I need to write some script to verify if that's true.
Here is the script for downloading the files: https://github.com/2blam/HK-Geo-referenced-Public-Facility-Data. It uses OCR to pass the captcha. Approximately 1 hour to run.
:+1: @2blam , don't know the download scripts are already released for a while. I have the following proposal to merge it.
The idea is to emulate another "table" in the original sense. Then we can seamlessly merge it with existing datapoints and start pivoting using the current backend API.
Suggest to output the following table:
area table row column value
-------------------------------------
a01 facility has_library value 1
a02 facility has_library value 0
...
a01 facility num_wifi value 10
a01 facility num_wifi value 12
...
a01 facility num_park value 10
a01 facility num_park value 12
...
The "column" column is just a dummy placeholder. As to type of facility, we can distinguish by "row" or by "table". Both should be equally convenient to application. The exact location of a facility does not matter too much. Our unit geo-location is area. So we can pre-process them in to numbers for each area. As to region
and district
columns of datapoints
, we can easily merge later.
I think @2blam can continue to do the pre-processing on 2blam/HK-Geo-referenced-Public-Facility-Data , since this is a different data set. I can submodule the repo after it is finished and merge those data points.
What do you think?
The public facility data was combined into a single excel file yesterday. Also, the data was converted into GeoJSON format. You can have a look about backend_pub_facility branch.
Refer to the proposed table, the pre-processed information can save the query time (shorten response time) and can use another approach to visualize the information (pie chart?), I think it is worth to try.
If we go ahead about this, then I will try to write a script to determine if a public facility fall into a particular constituency area.
@hupili Are you saying that instead of storing each individual facility, we store only the aggregate counts? In other words, given a CSV of all of the facilities:
For each area:
For each facility type:
Count the number of facilities of this type in this area
Create a single row in database where "row" value is the facility type
I think this is OK for our immediate use. I think eventually we may want to provide a facilities api, much as we did for the census data that preserves as much of the original data as possible.
I uploaded all_pub_facility_with_CACODE.json to google drive. Please check.
The json file
Here is the list of properties for each public facility: ENGLISH ADDRESS, TELEPHONE, FAX NUMBER, OPENING HOURS, 中文地址, ENGLISH NAME, EMAIL ADDRESS, 中文名稱, WEBSITE, ENGLISH CATEGORY, 中文類別 , CACODE
I need some time to tidy up my script, will share in the backend_pub_facility branch later. For your information, I adopted [2] to determine if the point located in a particular polygon.
Note:
Reference [1] - http://www.census2011.gov.hk/pdf/maps/Map_KC.pdf [2] - http://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/pnpoly.html
@2blam , it looks great. Since there are only 9K data points, which is small compared with original table, we can keep them all and use current API to aggregate (add a count
aggregator) them. I can manage the conversion from your GeoJSON to DB.
Two scripts (processPubFacilityCSV.py and addCACODEProperty.py) were added in backend_pub_facility branch. I found that LIBRARY_LCSD_20131213.csv Row# 187 has some problem, the English Name of such entry with extra \t and duplicated name. I fixed this error and zip all the csv files again with file name csv_files_err_fixed.zip. You can find this zip file in google drive (hk_pub_facility/csv/).
@2blam , is it possible to streamline all the operations from download to final generation? Like what data_preparation.py
does. For those we need manual fix, we can use a similar strategy as translation_fix.py
.
One of the major problem is the step about using QGIS to convert the csv (HK Grid Coordinates) to GeoJSON (GPS coordinates). At the moment, this step need to do manually. If we can find a command to do this conversion, then we would create an automated script from download to final generation.
On Wed, Feb 12, 2014 at 10:30 AM, HU, Pili notifications@github.com wrote:
@2blam https://github.com/2blam , is it possible to streamline all the operations from download to final generation? Like what data_preparation.py does. For those we need manual fix, we can use a similar strategy as translation_fix.py.
Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-34832872 .
This shell command will do it, but you need GDAL/OGR installed:
ogr2ogr -f "GeoJSON" OUTPUT_FILE INPUT_FILE -t_srs EPSG:4326 -s_srs EPSG:2326
EPSG:2326 is HK1980 projection, and EPSG:4326 is WGS84
Oops, I just realized it's only for shapefiles. I think I remember seeing a similar command somewhere for CSVs, though, let me check.
Apparently ogr2ogr
can be used to read CSV files, but you need to generate a VRT file and reformat the coordinates in the CSV to well known text. See this question, and the docs it links to 1 2
Wow, thx for the reference. Will have a look about this.
Apparently ogr2ogr can be used to read CSV files, but you need to generate a VRT file and reformat the coordinates in the CSV to well known text. See this questionhttp://gis.stackexchange.com/questions/24947/how-can-i-convert-a-csv-file-of-wkt-data-to-a-shape-file-using-ogr2ogr, and the docs it links to 1 http://www.gdal.org/ogr/drv_vrt.html 2http://www.gdal.org/ogr/drv_csv.html
Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/9#issuecomment-34834576 .
This command works: ogr2ogr -f GeoJSON OUTPUT_FILENAME -s_srs "EPSG:2326" PATH_TO_VRT_FILE -t_srs "EPSG:4326"
Will update the scripts accordingly.
I combined the conversion scripts into a single python file. Please check backend_pub_facility branch. Thanks.
OK, so I just reviewed the code here, and I think the strategy I will take is to pre-summarize to get per constituency area counts of various facility types. So the new "table" (in the sense of the census tables in our source data) will look like this:
@hupili proposed storing the point coordinates as the value, but I don't think we get much benefit from storing the point coordinates without also capturing the metadata (facility name, etc.), since we'd likely want to show those when mapping. So, just storing the counts is a shortcut that basically requires no changes to the frontend, so we can get this feature out the door before the end of the week.
I will write the integration code so that if the facilities data is available, it'll be added to the generation of the various data files.
Standardize and integrate the HK public facility dataset