A public project with a pickle-based dataset was created
The API works per account: only one token to manage the whole account; however projects can be either public or private; is only one token for read/write and apparently the same for admin? (not checked)
Token is configurated in a simple hidden folder in the home directory; it is a requirement to communicate with datadotworld; Is it that only those registered in datadotworld would get a token? (not checked)
The exercise used a pickle file - ~the Python package apparently doesn't handle commands to deal with formats that won't have read or readline methods; for handling pickle files a more elaborated code would be required (eg. https://pypkg.com/pypi/vecshare/f/vecshare/signatures.py); pickle is very much Python and shouldn't be used, but that means files that were loaded in different formats, like compressed ones, might not be easily extracted~
It was found later that the following script would unpickle the pickled file:
I changed the name of the project to "Amphibians" but it was not updated in the url !!
dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know
dataset.describe() # to get a description of the "dataset", which is actually the project
f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip'])
uzf = zipfile.ZipFile(f, "r")
uzf.namelist()
output => ['bouwprojecten.csv']
* Using Spark and some big data capabilities; the platform offers some features to explore and manipulate datasets, including a Workspace
* Loaded *.csv files should be comma-separated to be easily used by the datadotworld platform capabilities; there are other simple restrictions but they won't affect the file if extracted
* There is a course for free in DataCamp (https://campus.datacamp.com/courses/intro-to-dataworld-in-python/) to show how to use the datadotworld API for Python in combination with `pandas` library
* The API is "queriable" in SQL
* Example of working the working with python AND github with datadotworld ---> https://www.dataquest.io/blog/datadotworld-python-tutorial/
* Other capabilities, using SQL and the UI https://data.world/jonloyens/an-intro-to-dataworld-dataset
* Example of projects with Gov organizations (2016) http://www.esa.doc.gov/under-secretary-blog/dataworld-bring-valuable-commerce-datasets-social-network-data-people
* Help is scattered, specially for API capabilities there is no much examples to be found - NOTE: this is not a relevant aspect as users would use the API to load and download data mostly
* It has some presence in Medium with its own publication (https://meta.data.world/) as well as in some Data Science related articles
* Only up to 1GB allowed per dataset section (probably using Databricks or similar in the background?)
For more information about datadotworld and similar check the following list: https://docs.google.com/spreadsheets/d/1KptHzDHIdB3s1v5m1mMwphcwXhOVWdkRYdjEWW1dqrE/edit#gid=355072175
when I looked organically a topic dataset without mentioning data.world, data.world would appear in a simple google search if I added words like "open data" to the query.
the internal search of datasets in data.world list only 100 maximum, no matter how many datasets are related to the query (this is something I personally dislike...); however if we could "sell" the github repo correctly, the 100-only list shouldn't affect discoverability
they have an API with SDK's in Python and R only. I prepare something in Python for the demo; I am preparing the R code either alone or with someone else. We are missing a nodejs SDK though. It could be made for them... ;)
This is an exploration of external databases for some of the datasets, following a discussion started at https://github.com/freeCodeCamp/2017-new-coder-survey/issues/7 by @pdurbin.
A demo exercise is being built with my personal data. So far:
pip
of very recent update: https://pypi.python.org/pypi/datadotworldpickle
file - ~the Python package apparently doesn't handle commands to deal with formats that won't haveread
orreadline
methods; for handling pickle files a more elaborated code would be required (eg. https://pypkg.com/pypi/vecshare/f/vecshare/signatures.py); pickle is very much Python and shouldn't be used, but that means files that were loaded in different formats, like compressed ones, might not be easily extracted~ It was found later that the following script would unpickle the pickled file:dataset = datadotworld.load_dataset('https://data.world/ectest123/survey-2016') #notice the name in the url:
I changed the name of the project to "Amphibians" but it was not updated in the url !!
dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know
dataset.describe() # to get a description of the "dataset", which is actually the project
output was:
{'title': 'TestDatasets', 'resources': [{'path': 'data/bouwprojecten.csv', 'name': 'bouwprojecten', 'format': #'csv'}, {'format': 'pkl', 'path': 'original/allamphibians.pkl', 'name': 'original/allamphibians.pkl', 'mediatype': #'application/octet-stream', 'bytes': 269171}, {'format': 'csv', 'path': 'original/bouwprojecten.csv', 'name': #'original/bouwprojecten.csv', 'mediatype': 'text/csv', 'bytes': 143452}, {'format': 'zip', 'path': #'original/bouwprojecten.zip', 'name': 'original/bouwprojecten.zip', 'mediatype': 'application/zip', 'bytes': #18608}], 'name': 'ectest123_testdatasets', 'homepage': 'https://data.world/ectest123/testdatasets'}
for f in [dataset.dataframes, dataset.tables, dataset.raw_data]: #listing only raw_data because all pickled files are binary print(f)
output was:
{'bouwprojecten': LazyLoadedValue()}
{'bouwprojecten': LazyLoadedValue(
)}
{'original/bouwprojecten.zip': LazyLoadedValue(), 'bouwprojecten': LazyLoadedValue(), #'original/allamphibians.pkl': LazyLoadedValue(), 'original/bouwprojecten.csv': #LazyLoadedValue()}
working on the pickle file
unpickled = pickle.loads(dataset.raw_data['original/allamphibians.pkl']) #use the
loads
method, not theload
methodunpickled is my file!
working on the zipfile
check the following references:
--- https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file
--- https://docs.python.org/3/library/io.html
--- http://code.activestate.com/recipes/52265-read-data-from-zip-files/
import zipfile import io
f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip']) uzf = zipfile.ZipFile(f, "r") uzf.namelist()
output => ['bouwprojecten.csv']