IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
881 stars 493 forks source link

Feature Request/Idea: Subsetting by variable values #8371

Open sbaltzmit opened 2 years ago

sbaltzmit commented 2 years ago

Overview of the Feature Request It would be very useful for users to be able to subset datasets by the values of specific variables before downloading data, as for example described in the TwoRavens manual here (under "Subset"): https://guides.dataverse.org/en/5.6/user/data-exploration/tworavens.html

What kind of user is the feature intended for? Any user downloading data.

What inspired the request? I help administrate a fairly large repository of election result data (https://dataverse.harvard.edu/dataverse/medsl). Precinct-level results of American local, regional, and federal elections can easily reach 3 GB of text in one file, which makes the data very slow to access (a user who is looking for the results of just one statewide election still needs to download a multi-GB file with results from every race in the country, load that into software for subsetting it, and then perform the subsetting). Users can reduce the download burden by selecting specific variables, but they still have to download all of the rows. We have thought about splitting our data up and providing a variety of datasets for people to choose from, but it is very hard to anticipate exactly what data people might be after. We could for example upload one file for each state, but then users who want (say) the results of a presidential election would need to download all 50. Or we could supply each major race individually, but then users who want (say) the results of a given year's senate elections would need to download every file. I imagine there are analogous problems for large datasets about other subjects. It would be great if people could make those decisions for themselves while downloading the data.

What existing behavior do you want changed? Only added functionality; the existing subset tool is very good.

Any brand new behavior do you want to add to Dataverse? The option to subset by variable value when downloading data.

Any related open or closed issues to this feature request? Not to my knowledge.

pdurbin commented 2 years ago

@sbaltzmit hi! I'm confused by "the existing subset tool is very good." Are you talking about Two Ravens? Data Explorer?

Dataverse supports subsetting via API (see "subset" at https://guides.dataverse.org/en/5.9/api/dataaccess.html#parameters ) but we removed the builtin web interface for it in pull request #6098 because it was broken and a new alternative called Data Explorer, an external tool, supports subsetting (using the API). I'll give an example below with your file at https://dataverse.harvard.edu/file.xhtml?version=5.0&fileId=4300300

launch Data Explorer

Screen Shot 2022-02-04 at 11 16 40 AM

pick a couple variables for the subset (state and year)

Screen Shot 2022-02-04 at 11 17 11 AM

peek at the downloaded subset

$ head ~/Downloads/1976-2020-senate-subset.tab 
state   year
"ARIZONA"   1976
"ARIZONA"   1976
"ARIZONA"   1976
"ARIZONA"   1976
"ARIZONA"   1976
"CALIFORNIA"    1976
"CALIFORNIA"    1976
"CALIFORNIA"    1976
"CALIFORNIA"    1976

Will Data Explorer meet your needs? If not, are you able to use the API? Thanks for using Dataverse!

sbaltzmit commented 2 years ago

Thanks for the reply.

It appears that what you are downloading in those screenshots is every value of the variables "state" and "year". The feature I was requesting is not to download just a few variables, but rather to download all the rows that correspond to certain variable values.

So, you got two columns: every value of "state", and every value of "year". That's subsetting the variables that you're downloading, but not subsetting by values of those variables. Very commonly our users will want to download, say, every row of the dataset corresponding to the state value "Arizona", and that does not seem to be supported by dataverse.

pdurbin commented 2 years ago

@sbaltzmit thanks, makes sense, especially the "Arizona" example.

Would it be of value to you to have an API for subsetting by value? Or would you and your users also need a GUI?

sbaltzmit commented 2 years ago

If it was doable even just in the API, that would be fantastic. I'm sure a GUI would be that much more usable for some of our users, but if there were a way to do in the API then we could help them understand how to use it. Thanks again for the responses and assistance.

pdurbin commented 2 years ago

@sbaltzmit sure. Are you interested in making a pull request?