gazetteerhk / census_explorer

Explore Hong Kong's neighborhoods through visualizations of census data
http://gazetteer.hk
MIT License
42 stars 12 forks source link

Integrate public facility dataset #53

Closed hxu closed 10 years ago

hxu commented 10 years ago

The scripts to integrate the public facilities dataset are ready to go.

@hupili, can you run this on the backend after merging into master?

You need to first move the geojson file that @2blam put on the share drive (hk_pub_facility/geojson/all_pub_facility_with_CACODE.json), into the scripts/data folder as pub_facility_cacode.geo.json

Then, open up a python shell in scripts, and import public_facilities. Run public_facilities.main() and it will append the necessary datapoints and translations to all the files.

fixes #9

hupili commented 10 years ago

I'll process this one tonight. As for the raw data, is it possible to run our pipeline instead of downloading from GDrive? I saw @2blam already had some scripts but have not tried.

Anyway, this pipeline integration can be put lower priority.

hxu commented 10 years ago

It is possible, but I'd like to refactor the pipeline before integrating it.

When @2blam showed it to me at our last work day, he said that it took a couple hours to finish. The current method uses character recognition to break each captcha, which is slow, but quite clever. This code currently resides in @2blam's repo.

I think there is a way around this, by directly accessing the URLs, as the captcha appears to only prevent submitting the form to GET the file, but does not prevent sending a direct request to get the file. This should speed up the download dramatically, on par with the main pipeline.

I'm haven't taken a close look at the middle part, but we should at least refactor the download part to get the run time down.

So I'll go ahead and merge this into master? Or you will do it when you run the script?