This is a collaborative repository for contributing data to Data Commons.
If you are looking to use the data in Data Commons, please visit our API documentation.
Data Commons is an Open Knowledge Graph that provides a unified view across multiple public data sets and statistics. We've bootstrapped the graph with lots of data from US Census, CDC, NOAA, etc., and through collaborations with the New York Botanical Garden, Opportunity Insights, and more. However, Data Commons is meant to be for community, by the community. We're excited to work with you to make public data accessible to everyone.
To see the extent of data we have today, browse the graph.
We welcome contributions to the graph! To get started, take a look at the resources in the docs directory and the list of pending imports.
Apache 2.0
Every data import involves some or all of the following: obtaining the source data, cleaning the data, and converting the data into one of Meta Content Framework (MCF), JSON-LD, or RDFa format. We ask that you check in all scripts used in this process, so that others can reproduce and continue your work.
Source data must meet the licensing policy requirements.
Scripts should go under the top-level scripts/
directory, depending on the
provenance and dataset. See
the example for more
detail.
We provide some utility libraries under the top-level util/
directory. For
example, this includes maps to and from common geographic identifiers.
Install Git LFS
Fork this repo - follow the Github guide to forking a repo
git remote add upstream https://github.com/datacommonsorg/data.git
git remote -v
, the output should look like
this:shell> git remote -v
origin https://github.com/YOUR-GITHUB-USERNAME/data.git (fetch)
origin https://github.com/YOUR-GITHUB-USERNAME/data.git (push)
upstream https://github.com/datacommonsorg/data.git (fetch)
upstream https://github.com/datacommonsorg/data.git (push)
Please ask to join the datacommons-developers Google group. For example, membership in this group provides access to debug logs of pre-submit tests that run for your Pull Request.
Contribute your changes by creating pull requests from your fork of this repo. Learn more in this step-by-step guide.
A summary of the steps in the development workflow are:
git checkout master
git pull upstream master
git checkout -b new_branch_name
# Make some code change
git add .
git commit -m "commit message"
git push -u origin new_branch_name
Then in your forked repo, you can send a Pull Request. Wait for approval of the Pull Request and merge the change.
If this is your first time contributing to a Google Open Source project, you may need to follow the steps in contributing.md.
Code style guidelines ease understanding and maintaining code. Automated checks enforce some of the guidelines.
Ensure prerequisites are installed
Install requirements and setup a virtual environment to isolate python development in this repo.
python3 -m venv .env
source .env/bin/activate
pip3 install -r requirements_all.txt
Scripts should be accompanied with tests using the unittest
framework, and named with
an _test.py
suffix.
A common test pattern is to drive your main processing function through some sample input files (e.g., with a few rows of the real csv/xls/etc.) and compare the produced output files (e.g., cleaned csv, mcf, tmcf) against expected ones. An example test following this pattern is here.
IMPORTANT: Please ensure that there is an
__init__.py
file in the directory of your import scripts, and every parent directory untilscripts/
. This is necessary for theunittest
framework to automatically discover and run your tests as part of presubmit.NOTE: In the presence of
__init__.py
, you will need to adjust the way you import modules and run tests, as below.
You should import modules in your test with a dotted prefix like this.
Instead of running your test as python3 foo_test.py
, run as:
python3 -m unittest discover -v -s ../ -p "foo_test.py"
Consider creating a generic alias like this:
alias dc-data-py-test='python3 -m unittest discover -v -s ../ -p "*_test.py"'
Then, you can run your tests as:
dc-data-py-test
requirements_all.txt
file in the top-level folder. No other requirements.txt
files are allowed.Consider automating coding to satisfy some of these requirements.
--style google
.To run the tools via a command line (both installed after setup steps above):
# Update (--in-place) all files
./run_tests.sh -f
# Produce differences between the current code and reformatted code. Empty
# output indicates correctly formatted code.
./run_tests.sh -l
To run a unit test, use a command like
python3 -m unittest discover -v -s util/ -p "*_test.py"
The discover
option searches (-s
) the util/
directory for files with
filenames ending with _test.py
. It considers all these files to be unit tests
to be run. Output is verbose (-v
).
We provide a utility to run all unit tests in a folder easily (e.g. util/):
./run_tests.sh -p util/
Or to run all tests and checks:
./run_tests.sh -a
NOTE: Please ensure that all tests are runnable from the test script, e.g. modules should be relative to the root of the repo.
Occasionally, one has to disable style checking or formatting for particular lines.
To disable pylint for a particular line or block , use syntax like
# pylint: disable=line-too-long,unbalanced-tuple-unpacking
To disable yapf for some lines,
# yapf: disable
... code ...
# yapf: enable
foo.go
, use
golangcli-lint run foo.go
._test.go
are considered tests. They
are executed using go test.For general questions or issues about importing data into Data Commons, please open an issue on our issues page. For all other questions, please share feedback on this form.
Note - This is not an officially supported Google product.