datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
774 stars 98 forks source link

Population Reference Bureau #264

Open rufuspollock opened 5 years ago

rufuspollock commented 5 years ago

http://prb.org - detailed demographic and health data for US and internationally.

TODO: work out if data open and where you can get in bulk.

sglavoie commented 4 years ago

On the page https://www.prb.org/international/ I extracted the URLs where data is available:

I believe that merging all those CSV files could make for an interesting and more complete dataset as the data is clean and up to date (2019).

The same can be done for a number of other indicators without requiring much change to the Python script linked below, namely the ones specific to the United States:

For now, I made a simple script in my spare time to download the data and trim the unneeded lines in the CSV files (headers) and it can be found here: https://github.com/sglavoie/awesome-data-prb

It's useful to quickly download and see the data but it is not processing it in any form as I couldn't find the license and know for sure that I should continue in that direction. I am leaving this comment as a reference for later in case it can speed things up if someone is going to poke at it.

rufuspollock commented 4 years ago

@sglavoie have you found any info on the PRB's licensing of this data?

Also do you have any estimate on a) how big the data is b) how much time it would take to collect this?

sglavoie commented 4 years ago

licensing of this data

I am in the process of finding out and will let you know as soon as possible.

a) how big the data is

In regards to the international data, this would be a very small dataset containing 21 CSV files, totalling 154.7 KiB (158,367 bytes). The CSV files on the page https://www.prb.org/international/ all feature five columns and 235 rows of useful data.

When it comes to the US data, this is a more sizeable dataset: 18 items, totalling 13.5 MiB (14,151,769 bytes). Each individual file has a very different amount of data. The smallest one is 3.4 KiB (3,526 bytes) and contains 5 columns X 105 rows. The largest one is 2.2 MiB (2,358,702 bytes) and contains 5 columns X 60,687 rows.

b) how much time it would take to collect this?

I already stored all the data with minimal processing (only removing superfluous headers). Curating it should be quick as the data is already clean: it would only be a matter of knowing if each individual CSV file should be published as a separate dataset or if you think we could produce more insight by merging them together in a given way.

rufuspollock commented 4 years ago

@sglavoie ok let's start curating this - you can do this as well as emojis 😄 (still check the license info though)

sglavoie commented 4 years ago

Perfect! I'm on it! I couldn't get the line when I called earlier today, but I've sent them inquiries by other means and will call again until this matter is solved :smile:.

sglavoie commented 4 years ago

About the data

Done! It's available as CSV files with a validated datapackage.json. A minimal repo is available here containing the script used to make this possible. (Update: The data is also now available in the repo.) For convenience, I temporarily added the cleaned data to my Google Drive: it can be found at this link.

About the license

Sorry, this is still messy. The following are literally all the 25 results that pop up on Google when searching for site:prb.org "license". There is no clear information available on their website as far as I'm aware declaring how their public data can be reused. From files stored on their website, I was able to glean the following:

The suggested citation, if you quote from this publication, is [...] For permission to reproduce portions from the Population Bulletin, write to PRB, Attn: Permissions, or permissions@prb.org.

rufuspollock commented 4 years ago

@sglavoie i would add the data to the repo rather than google drive. I also suggest naming the repo after the data package name.

Re the license i would not focus on the license for text publications - that is completely different. In general, data like this which is clearly factual, is like public domain (at least in US) so I think we can say that with suitable disclaimer that this is our opinion. You could also email them asking them to clarify the public domain status / licensing of their data.

sglavoie commented 4 years ago

Thank you for your feedback!

rufuspollock commented 4 years ago

@sglavoie can we move this item to datasets github org now ...

sglavoie commented 4 years ago

@rufuspollock, the script works really well from the testing I made. However, there are two possible issues I am aware of:

rufuspollock commented 4 years ago

@sglavoie can you avoid pandas perchance? And use e.g. dataflows instead ...

Click is fine as a wrapper and is a lightweight dependency ...

sglavoie commented 4 years ago

@rufuspollock, sure!

Click doesn't bother much from the user perspective either: I can just remove the need for user input from one place and it stays exactly the same. That will be quick!

As for Pandas, I can certainly make the necessary changes, it's only one function in the script. I was working on a high-priority project with a deadline for tomorrow and didn't have time to get back to this issue earlier, but I will keep it in mind and will execute the task as soon as you give me the green light :wink:.

sglavoie commented 4 years ago

@rufuspollock, this should now be good to go! :slightly_smiling_face:

The new version of this script is available in the same repo as before: https://github.com/sglavoie/population_reference_bureau

rufuspollock commented 4 years ago

@sglavoie great - final point is that i would use - in repo names rather than _ (and in package name). And then can you move this over to datasets org here.

sglavoie commented 4 years ago

@rufuspollock, I've made the necessary changes and updated the data to reflect those changes.

I cannot transfer ownership of the repo to the "datasets" organization (I would need to be a member to be allowed to do so) but I went ahead and transferred it to you with the best intention in mind. :crossed_fingers: