freeCodeCamp / open-data

Other
157 stars 39 forks source link

TEST: Suggesting the use of external databases: datadotworld #6

Closed evaristoc closed 6 years ago

evaristoc commented 6 years ago

This is an exploration of external databases for some of the datasets, following a discussion started at https://github.com/freeCodeCamp/2017-new-coder-survey/issues/7 by @pdurbin.

A demo exercise is being built with my personal data. So far:

dataset = datadotworld.load_dataset('https://data.world/ectest123/survey-2016') #notice the name in the url:

I changed the name of the project to "Amphibians" but it was not updated in the url !!

dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know

dataset.describe() # to get a description of the "dataset", which is actually the project

output was:

{'title': 'TestDatasets', 'resources': [{'path': 'data/bouwprojecten.csv', 'name': 'bouwprojecten', 'format': #'csv'}, {'format': 'pkl', 'path': 'original/allamphibians.pkl', 'name': 'original/allamphibians.pkl', 'mediatype': #'application/octet-stream', 'bytes': 269171}, {'format': 'csv', 'path': 'original/bouwprojecten.csv', 'name': #'original/bouwprojecten.csv', 'mediatype': 'text/csv', 'bytes': 143452}, {'format': 'zip', 'path': #'original/bouwprojecten.zip', 'name': 'original/bouwprojecten.zip', 'mediatype': 'application/zip', 'bytes': #18608}], 'name': 'ectest123_testdatasets', 'homepage': 'https://data.world/ectest123/testdatasets'}

for f in [dataset.dataframes, dataset.tables, dataset.raw_data]: #listing only raw_data because all pickled files are binary print(f)

output was:

{'bouwprojecten': LazyLoadedValue()}

{'bouwprojecten': LazyLoadedValue()}

{'original/bouwprojecten.zip': LazyLoadedValue(), 'bouwprojecten': LazyLoadedValue(), #'original/allamphibians.pkl': LazyLoadedValue(), 'original/bouwprojecten.csv': #LazyLoadedValue()}

working on the pickle file

unpickled = pickle.loads(dataset.raw_data['original/allamphibians.pkl']) #use the loads method, not the load method

unpickled is my file!

working on the zipfile

check the following references:

--- https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file

--- https://docs.python.org/3/library/io.html

--- http://code.activestate.com/recipes/52265-read-data-from-zip-files/

import zipfile import io

f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip']) uzf = zipfile.ZipFile(f, "r") uzf.namelist()

output => ['bouwprojecten.csv']


* Using Spark and some big data capabilities; the platform offers some features to explore and manipulate datasets, including a Workspace
* Loaded *.csv files should be comma-separated to be easily used by the datadotworld platform capabilities; there are other simple restrictions but they won't affect the file if extracted
* There is a course for free in DataCamp (https://campus.datacamp.com/courses/intro-to-dataworld-in-python/) to show how to use the datadotworld API for Python in combination with `pandas` library
* The API is "queriable" in SQL
* Example of working the working with python AND github with datadotworld ---> https://www.dataquest.io/blog/datadotworld-python-tutorial/
* Other capabilities, using SQL and the UI https://data.world/jonloyens/an-intro-to-dataworld-dataset
* Example of projects with Gov organizations (2016) http://www.esa.doc.gov/under-secretary-blog/dataworld-bring-valuable-commerce-datasets-social-network-data-people
* Help is scattered, specially for API capabilities there is no much examples to be found - NOTE: this is not a relevant aspect as users would use the API to load and download data mostly
* It has some presence in Medium with its own publication (https://meta.data.world/) as well as in some Data Science related articles
* Only up to 1GB allowed per dataset section (probably using Databricks or similar in the background?)

For more information about datadotworld and similar check the following list: https://docs.google.com/spreadsheets/d/1KptHzDHIdB3s1v5m1mMwphcwXhOVWdkRYdjEWW1dqrE/edit#gid=355072175
evaristoc commented 6 years ago

This link can be useful? https://stackoverflow.com/questions/30469575/how-to-pickle-and-unpickle-to-portable-string-in-python-3

Relevant... https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file

evaristoc commented 6 years ago

@QuincyLarson Some web analytics facts about data.world: