gdcc / pyDataverse

Python module for Dataverse Software (dataverse.org).
http://pydataverse.readthedocs.io/
MIT License
63 stars 43 forks source link

How to read datafile as pandas, as well as original format #80

Closed kuriwaki closed 3 years ago

kuriwaki commented 3 years ago

The datafest (i.e. pre-v0.3.0) code has us importing datafiles as follows. However, this code currently gives me metadata about the file, not the 3000 x 300 tabular dataset it is supposed to be (https://doi.org/10.7910/DVN/HIDLTK).

Can you show (1) how to import this as a pandas dataframe, and (2) whether it's possible set an option format = original in get_datafile to download the original, not the ingested version of files? For example the file in question is orginally a CSV, but was a TSV when ingested into Dataverse.

I wasn't sure how this worked after reading the get_datafile page in Docs/Reference, but if there's any other place in Docs I should look that would be helpful too.

import io
import pandas as pd
from pyDataverse.api import NativeApi

doi = "doi:10.7910/DVN/HIDLTK"
base_url = "https://dataverse.harvard.edu"

api = NativeApi(base_url)
resp = api.get_dataset(doi)
datafiles = resp.json()["data"]["latestVersion"]["files"]

# confirm file
print(datafiles[8]["dataFile"]["file_name"])
# 'us_county_confirmed_cases.tab'
print(datafiles[8]["dataFile"]["id"])
# 4360740

# datafile
datafile_id = "4360740"
resp = api.get_datafile(datafile_id)

# try to read as csv
data = io.StringIO(str(resp.content, 'utf-8'))
us_states_cases = pd.read_csv(data, sep='\t') # any option to get the data so to read as csv?
print(us_states_cases.head(10)) # this gives a long line of metadata, not a clean dataframe
skasberger commented 3 years ago

@kuriwaki: here the updated one, which should work.

import io
import pandas as pd
from pyDataverse.api import NativeApi
from pyDataverse.api import DataAccessApi

doi = "doi:10.7910/DVN/HIDLTK"
base_url = "https://dataverse.harvard.edu"

n_api = NativeApi(base_url)
resp = n_api.get_dataset(doi)
datafiles = resp.json()["data"]["latestVersion"]["files"]

# confirm file
print(datafiles[8]["dataFile"]["filename"])
# 'us_county_confirmed_cases.tab'
print(datafiles[8]["dataFile"]["id"])
# 4360740

# datafile
datafile_id = datafiles[8]["dataFile"]["id"]
da_api = DataAccessApi(base_url)
resp = da_api.get_datafile(datafile_id)

# try to read as csv
data = io.StringIO(str(resp.content, 'utf-8'))
us_states_cases = pd.read_csv(data, sep='\t') # any option to get the data so to read as csv?
print(us_states_cases.head(10)) # this gives a long line of metadata, not a clean dataframe

The problem was: To get the Datafile, it is necessary to use the DataAccessApi. And also fixed file_name to filename for the datafile-key.

kuriwaki commented 3 years ago

Thanks! This example worked.