datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
797 stars 101 forks source link

Sub-country boundaries (vector) - Natural Earth Admin 1 #168

Open rufuspollock opened 8 years ago

rufuspollock commented 8 years ago

Geo Package "Admin 1 – States, Provinces"

Questions:

@lexman would you be interested in taking this one on?

lexman commented 8 years ago

Looks like something I could do. I'll have a look, but not before a few days.

rufuspollock commented 8 years ago

@lexman great - thanks and no rush!

lexman commented 8 years ago

I started looking at the data. It states clearly on the site (http://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-1-states-provinces/) that this data is still in beta.

Is this one package all in one file, one package multiple files or one package per country?

I see two use cases : some could need the first administrative division of their own country. Some could need these for the whole world, or part of it. It seems easier to filter useful data than to aggregate several file. For example, the cartographic tool org has some optons for filtering data and none for aggregation of files. As for aggregating several datapackages it does not look frictionless. So I would go for one package, one file. Should the need for individual country arise, we can add all the countries. With in their own file inside the datapackage ?

I started a test project at https://github.com/lexman/geo-admin1

About resolution, we have to start with the 10m set because the other resolutions are not complete yet. They only cover Us and Canada : http://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-1-states-provinces/

Problem is : the geojson file is 58 MB... I could not even push it on github ! I'll figure out something to reduce the size.

As for the naming... I leave that for the next comment.

rufuspollock commented 8 years ago

@lexman ok - remember we could break per country and start with 50m and just do US and Canada to start with.

In terms of size we could consider git lfs but i don't like that approach that much ...

lexman commented 8 years ago

I cut the floating part of the coordinates to 6 (about 10 cm on earth) which reduces the file to 30 MB, and it should not impact the topology of the areas. If someone has checked out the repo, I had to force change the history of the repository.

Don't worry about the source (50m or 10m), with tuttle it's reaaly easy to change later.

lexman commented 8 years ago

About the naming, my first though was : we'll make the best reference data, so we don't have to remind in the name of the package that the data comes from Natural Earth. I'm not so sure anymore.

This notion of admin1 is very difficult. If you see the test package on the viewer http://data.okfn.org/tools/view?url=https%3A%2F%2Fgithub.com%2Flexman%2Fgeo-admin1, you can see that the United Kindom is split in very small parts inside (smaller than England, Scottland, Wales and Northern Ireland), whereas the United States are split in states. I really don't know whether we should aim to put in the same level states and provinces.

For the moment, I think I will rename my test package to geo-ne-admin1 and clearly state that we transparantly expose what NE considers as admin1.

lexman commented 8 years ago

@rgrp , you said that

Is there a way to get some of this moving back upstream to natural earth - i know there was interaction with NE folks in #38

What are you thinking of ?

Also, as a french, I looked at the data and there are a bit of confusion beetwen departement and region.

The admin1 from Natural Earth is obviously still in beta.

lexman commented 8 years ago

@rgrp , @pdehaye Even if quality isn't perfect, do you want to add it or wait until it is better ?

rufuspollock commented 8 years ago

@lexman

/cc @pdehaye @jalbertbowden any thoughts?

jalbertbowden commented 8 years ago

looks good enough to me. clearly you all have been doing much more and are more aware but i don't see anything that makes it unusable. for viewing, perhaps create topojson version also? i know there are differences between the two files, but i don't see that as a reason to discredit topojson as a format for viewing/rendering, especially if one is aware that of the differences between geojson/topojson. converted admin1.geojson to .topojson and difference in size is telling: geo is 30.6mb, topo is 4.6. here it is in a gist for comparison. https://gist.github.com/jalbertbowden/e5a75241fb07c63f2762

lexman commented 8 years ago

@rgrp I've added a geo-ne-admin1-us repository with US at low scale. I perfectly understand the need for this dataset, however I'm not totaly satisfied for two reasons :

Anyway, it is ready to be published if needed :

rufuspollock commented 8 years ago

@lexman

One minor points:

Otherwise this looks great and we should get this in. @pdehaye are you happy to manage the formalities with @lexman ?

pdehaye commented 8 years ago

@rgrp sure, can manage

pdehaye commented 8 years ago

@lexman this looks good, but I need you to transfer ownership of the dataset to the "datasets" organisation. Let me know (here) when you have done that.

lexman commented 8 years ago

I'm back !

Following @rgrp I've put back the id column to be faithfull with NE. But I also added a state_code column, because this two letter code is the frictionless well known code used by every one, easy for joins.

And I've droped the preparation section. It will be nicer for the viewer, instead of a dead link.

@pdehaye I'm giving ownership in a minute...

lexman commented 8 years ago

Here are the final links after transfert

geo-ne-admin1 :

geo-ne-admin1-us :

@pdehaye I've created a pull-request in case you agree to add to add the datasets :)

jalbertbowden commented 8 years ago

realize i'm doing nothing but playing team spirit from afar, but great job! very, very, nicely done!

lexman commented 8 years ago

Team spirit is good... Thanks @jalbertbowden !

lexman commented 8 years ago

@pdehaye, @rgrp has merged the changes in the reference csv files and i've reloaded the list, but the new datapackage isn't a aimable on the core datapackage list. Do you have any idea ? Have I missed something ?

pdehaye commented 8 years ago

@lexman certainly part of the problem is that the core CSV was badly formatted (note: we also need to add it to the txt file). I fixed that now at https://github.com/datasets/registry/commit/a1018fd2f72de16b8a9b7e6c7b52d0369eee0964 But it still does not reload and I absolutely don't know why. I asked @rgrp to have a look.