Zaczero / osm-countries-geojson

🗺️ OpenStreetMap Countries GeoJSON — updated daily!
https://osm-countries-geojson.monicz.dev
GNU Affero General Public License v3.0
12 stars 0 forks source link

A few suggestions #1

Open codemasher opened 3 months ago

codemasher commented 3 months ago

Hey, I just found this awesome library and I'd like to suggest some things for improvement:

I can help you with any of these tasks if necessary.

Cheers!

Zaczero commented 3 months ago

Wow! Those are some exceptionally good suggestions. :star:

I have a question:

I guess you had the first option in mind, but I am worried that packages will be ~50MB in size (gz compressed). Perhaps that's fine?

About your 2nd point (and 4th partially): I don't want to be GitHub-dependant and serving files from my own server is my preference. At some point I'll remove the /geojson folder from this Git repository (see readme).

Using releases sounds like a smart option, especially if we integrate GitHub+NPM+Packagist into a single workflow. Releases would be done monthly, but my server would remain to serve updates daily.

I'll definitely use your suggestions to improve this project. Please give me some time as I am currently focusing on openstreetmap-ng.

codemasher commented 3 months ago

Thank you for the quick reply!

When publishing to npm/packagist [...]

As for npm/packagist releases: I think one resolution initially is good enough for these (e.g. 001, which clocks about 15MB) - JS may struggle to load huge json files, especially in browsers, while PHP doesn't bat an eye loading 100MB.

Thinking about it, this might require some extra measures as both of these services work differently when packing the files:

Additionally, a little helper library for each supported language would be great to ease implementation, maybe along with a lookup table country code => feature id so that single countries can be accessed easily without iterating the whole features array over and over (from what I understand countries are fetched asynchronously and may appear in random order [?], so a lookup table might need to be refreshed on each package build).

About your 2nd point (and 4th partially) [...]

I meant that you should remove the gz and br compressed files in this repository as binaries only create bloat in the git history - the build artifacts are the best place for that instead - you could even upload the files to your own server during the build. See it as a backup, if your own server is down for whatever reason :)

Using releases sounds like a smart option

Absolutely! Packagist specifically is easy to integrate as it uses GitHub tags/releases and automatically updates with a webhook. NPM is a bit nasty especially when you use 2FA - there are a bunch of auto-publish actions out there but I haven't used any of them yet. NPM disregards GitHub tags and uses its own versioning (version has to be specified in package.json), so the workflow would be: update NPM version number -> publish to NPM -> create GH tag/release during publish -> Packagist auto updates

As for PHP/JS helper libraries and integration boilerplate I'd gladly submit PRs if you don't mind.

Oh, one thing maybe to do beforehand would be to move the python files out of the repository root, as it can get messy with all the package related files in there, I'd put them in a separate tools directory and language helper related files in separate directories such as js php and python or something.

Zaczero commented 3 months ago

I don't think publishing a single format is a good choice. Some developers may prefer to use smaller files to accelerate their computation. Others may prefer high quality results. In npm case I think we could include all JSON files and a JavaScript bundler will usually include only imported ones. There is also plenty of JavaScript apps that run on server-side where loading big files is fine.

e.g.,

const data1 = require('package/data1.json');
const data2 = require('package/data2.json');
console.log(data1);
console.log(data2);

I don't think the packages should include any additional utility functions, GeoJSON is quite a basic format so filtering it is simple. This will require additional maintenance for very little benefit to the project.

Agree, the repository will probably need a small reorganization if we are going to support npm/packagist packaging.

codemasher commented 3 months ago

I don't think publishing a single format is a good choice

To be honest, I'm not sure what would be the best course of action here, I was just thinking of possible bloat. Sure, in JS, when you compile a library, it will only include the required file. In PHP on the other hand, all files will be downloaded and hog possibly unnecessary space on the server.

I don't think the packages should include any additional utility functions

For PHP at least it should include a (namespaced) loader so that the files can be opened easily, otherwise one would have to find them in the vendor directory manually, which isn't good practice. The JS equivalent could host an index.js with module exports for example.

Speaking of filesize, I noticed that the files contain a LOT of metadata - it's perhaps the most comprehensive set of country metadata I've seen in a while (which is amazing!), in fact, it is so much metadata, it warrants its own library (which is a whole new can of worms I don't want to open right now).

Zaczero commented 3 months ago

It's fine if packages include basic loader methods. I just don't want to include methods for filtering data or doing anything not related to loading it.

In PHP on the other hand, all files will be downloaded and hog possibly unnecessary space on the server.

When talking about disk usage, we could serve PHP files that contain gz-compressed strings and call gzdecode on import. However, there is no widely compatible JS-equivalent (only newer versions support GZ decompression methods).

Speaking of filesize, I noticed that the files contain a LOT of metadata

This metadata comes directly from OpenStreetMap. I believe most of the file size is due to encoding the polygons in GeoJSON format. There are other more efficient storage methods. Perhaps we could publish packages that store data in this format and then convert it into GeoJSON on load? This should work in pretty much any language, including JS and PHP.

codemasher commented 3 months ago

When talking about disk usage, we could serve PHP files that contain gz-compressed strings and call gzdecode on import.

Better let's not, that's not good practice either and could raise suspicion over malicious code, aside of re-introducing the issue with the git history bloat. Pure JSON files are perfectly fine in PHP. It's what, like 175MB of data? Thinking about it again, one who includes such a library is possibly aware of the bloat it brings, so there's that.

Perhaps we could publish packages that store data in this format

I was thinking in a different direction: the country metadata (which makes up ~2MB per file) is useful on its own, outside of the GeoJSON format. It could be a bit edited so that it becomes more accessible (see the gist I linked), so that it can be included in other projects. The Feature/Properties/tags data blob has only little use in libraries like leaflet, which usually only render the name field by default. The moment one wants to make use of the extra data, they would have to write custom code that accesses the fields, in which case they can access the same data from a different library instead. So, for example the properties here would only contain the name field with the localized name (default value) and one custom field with the 3-letter ISO country code as an identifier, by which the extra data can be loaded.

Zaczero commented 3 months ago

I think it's a good idea to make metadata its own dedicated package.

About npm/php, I want to avoid having users download and store massive files when it could be done better. Polyline decoder will notably reduce the amount of disk usage (at the small cost of increased load time). I have seen some apps using this encoding so it seems reliable.

codemasher commented 3 months ago

About npm/php, I want to avoid having users download and store massive files when it could be done better. Polyline decoder will notably reduce the amount of disk usage (at the small cost of increased load time).

Ok so I've looked into the polyline encoder and I think there are several issues with it:

The thing is: no matter how many % of the file size you save, these files remain HUGE due to their nature, which is fine (anything over a few kilobytes is huge for a script library). Just add a file size warning in bold on top of the documentation and be done with it :)

One way to reduce data pushed to a client would be what I suggested earlier - add a lookup table and selector function so that it's possible to return only a selection of countries.

Here is a similar package btw. and they provide a separate package for each resolution, which is a bit messy to maintain but allows dependents to pick exactly the size for their purpose.

Zaczero commented 3 months ago

It adds a dependency/additional code which needs to be maintained

Not really, the decoder is very short and practically never changes. That's why you see nobody maintaining it, it's setup-and-forget. We would obviously include the implementation in the package so there are no external dependencies.

as it is a proprietary algorithm

There is nothing proprietary about it. It's free and open source software.

that consume GeoJSON incompatible

It would be decoded into GeoJSON on package import. GitHub, since it cannot execute code, would remain using the .geojson files. We are talking about packages specifically here.

The algorithm is lossy

The precision is configurable, we can set it arbitrarily. OpenStreetMap has database precision of 7 decimal places. It's not possible to go higher than this.

The compression rate may vary

By quickly testing this method on the 14.3MB file, with maximum precision (7) i got a 6.4MB file. In the npm case, when additional compression (testing gzip) is performed by the browser/webserver, 14.3MB becomes 5.4MB and 6.4MB becomes 3.5MB.

codemasher commented 3 months ago

That's why you see nobody maintaining it, it's setup-and-forget.

I think we both know that's not how it works. PHP for example is currently evolving so fast and removing decades old tech debt, that it broke countless libraries within like the past 4 minor versions by adding deprecation warnings (been there...).

There is nothing proprietary about it. It's free and open source software.

By proprietary I mean it's a vendor specific thing and not part of the GeoJSON spec, which in turn would make the json files dependent on the helper libraries, which I think should be avoided.

The precision is configurable, we can set it arbitrarily

The algorithm's precision is fixed to 5 (Google supports 9 or 10 I think). And of course we want the highest precision OSM offers, which in turn results in a large file. It's fine, its what was requested. The question shouldn't be "how can we reduce file size?" but "what's the best way to distribute the files of any resolution?".

Zaczero commented 3 months ago

By proprietary I mean it's a vendor specific thing and not part of the GeoJSON spec, which in turn would make the json files dependent on the helper libraries, which I think should be avoided.

I don't think you got the idea. The goal is to have the formatX->GeoJSON conversion made during the loading process of the npm/PHP package. Then we can use whatever formatX we want to save on disk space and download size.

The algorithm's precision is fixed to 5

No, it's not.

"what's the best way to distribute the files of any resolution?"

This project is hard-capped by the resolution of OSM database. Storing any higher precision than 7 will not yield better results.

codemasher commented 3 months ago

This project is hard-capped by the resolution of OSM database. Storing any higher precision than 7 will not yield better results.

I didn't say that. I said it should provide the maximum possible resolution.

Here, in the Google documentation literally says that the polyine encoding algorithm is fixed to 5 decimal places and why that is:

"Given a maximum longitude of +/- 180 degrees to a precision of 5 decimal places (180.00000 to -180.00000), this results in the need for a 32 bit signed binary integer value."

The goal is to have the formatX->GeoJSON conversion made during the loading process of the npm/PHP package.

Again, making it dependent on a helper library is not good - it would limit the usage to the available helpers - that's something another library could do instead, which in turn would ease distribution.

With all that said, I could provide you with a proof-of-concept, as further discussion re file size is probably going nowhere.