IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 490 forks source link

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

Closed stevenmce closed 6 years ago

stevenmce commented 6 years ago

When ingesting SPSS or Stata data which includes variables that contains both labelled and unlabelled values, the ingest process assigns values with no label as N/A. For example, a 10 point scale (1 = Not at all, 10 = Very much, no value labels assigned for responses 2 to 9) would result in an ingested version which contains only the values 1, 10 and NA. This results in the UNF data being different between the original and ingested versions.

We have attempted to provide a demonstration of the issue with simulated data at: https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZYGTWM

djbrooke commented 6 years ago

Thanks @stevenmce, we'll take a look at this.

landreev commented 6 years ago

To clarify, this is specifically about exporting ingested files as RData, correct? (it's not mentioned above, but looking at the linked sample dataset suggests that it is...)

Off the top of my head, there definitely is a known problem with calculating UNFs of categorical variables (or "factors" in the language of R), when converting datafiles between Stata/SPSS and R; it is caused by the fact that R handles its factors in a way that's fundamentally different from other formats. (see http://guides.dataverse.org/en/4.8.6/user/tabulardataingest/rdata.html#r-factors for some discussion of the controversy).

But this specific case - empty labels of distinct categories all becoming NA factor values - is unambiguously wrong; and we should come up with a way to handle it better.

We'll discuss this and follow up.

landreev commented 6 years ago

Just to reiterate, I am fairly positive that this is not an ingest problem, as the original description suggests. I still need to review the ingested files in your sample dataset more carefully (thanks btw, for providing all the examples and information) to confirm. But it looks to me like the ingest process does the right thing. For ex., it ingests the variable column v6b in the stata file as a numeric categorical vector; where the numeric values 1 and 5 have assigned labels; but the other values happen to have none. It's the conversion to R that seems to be the problem. It looks like we just never thought about this situation - what to do when there are missing labels like this. My first guess, we should just be using "2", "3", "4" etc. for the values that are missing the descriptive labels... But please let us investigate/think about this some more.

djbrooke commented 6 years ago

@stevenmce - we talked about this in Backlog Grooming today and we'll need some more information and investigation before we estimate. We'll have another opportunity to discuss next Wednesday.

pdurbin commented 6 years ago

@stevenmce emphasized the importance of this bug during last week's community meeting so I assigned it to myself and plan to bring it up during backlog grooming in the context of pull request #4708, which is currently in flight.

djbrooke commented 6 years ago

@pdurbin let's keep it separate if possible - that one is already big enough :)

@landreev, you had mentioned some further investigation above. Can you work with @pdurbin to make sure the efforts are coordinated and can we please try to get some specific questions to @stevenmce and the ADA team before we groom this?

landreev commented 6 years ago

@pdurbin @djbrooke I believe there are multiple reasons why this should be treated separately from the PR 4708. One is that the sample ingested Stata file they provide (https://demo.dataverse.org/file.xhtml?fileId=28929&version=RELEASED&version=.0) is an "old-school", pre-v.13 Stata. So it's not even handled by the ingest plugin @benjamin-martinez and @oscardssmith have been working on. They (the people who submitted the report) do mention that they haven't had a particularly good luck with Stata 13 - meaning, most of their Stata 13 files just don't ingest, period - and that most likely will be much improved by the fixes to the Stata 13+ ingest plugin. But that was not specific to the value label bug this issue is mainly about. But, most importantly - as I tried to explain in the comment above, this does not appear to be a Stata ingest; or ingest in general. Rather, it's a problem with converting the data to RData format, once ingested. We don't seem to be doing anything wrong with how we store these value labels in the database. What we are doing wrong, is we somehow fail to pass the list of these labels to R, when we export to R, in a way that does not confuse R into thinking that no label on the list = missing value.

landreev commented 6 years ago

The TL;DR version: This is an issue outside of the Stata 13 ingest. If we need to prioritize it, because it's important to a partner, let's do that.

landreev commented 6 years ago

(@pdurbin - thanks for bringing this up during standup! I did miss these last comments in this issue from yesterday...)

pdurbin commented 6 years ago

@benjamin-martinez thanks for investigating this bug with me today and thank you @landreev for the summary here and in person after tech hours. To re-iterate, the bug seems to on export to RData format.

By the way, Ben and I weren't sure what value Dataverse is adding by creating its own RData derivative of an original RData file but whatever. 😄

djbrooke commented 6 years ago

@landreev @pdurbin @benjamin-martinez

Do you feel that there's enough here to get an estimate on the fix for this and pull it into a sprint? Or do we need more info from ADA? I'd like to bring it into the sprint today if we have enough info.

pdurbin commented 6 years ago

@benjamin-martinez here is the code we were looking at:

src/main/webapp/file-download-button-fragment.xhtml

<p:commandLink rendered="#{!downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               actionListener="#{fileDownloadService.startFileDownload(guestbookResponse, fileMetadata, 'RData')}">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>
<p:commandLink rendered="#{downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               action="#{guestbookResponseService.modifyDatafileAndFormat(guestbookResponse, fileMetadata, 'RData' )}"
               update="@widgetVar(downloadPopup)"
               oncomplete="PF('downloadPopup').show();handleResizeDialog('downloadPopup');">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>

Those methods are in these "service beans":

Ultimately, the redirect sends the user to this API bean: src/main/java/edu/harvard/iq/dataverse/api/Access.java

http://guides.dataverse.org/en/4.9/api/dataaccess.html shows "formats" such as "original" and "RData".

Heres's a screenshot from Firefox:

screen shot 2018-06-20 at 4 29 11 pm

From what @landreev was saying at the meeting this afternoon, R code is generated on the fly by Dataverse and sent to Rserve, which is mentioned here: http://guides.dataverse.org/en/4.9/installation/r-rapache-tworavens.html

landreev commented 6 years ago

@benjamin-martinez @oscardssmith The class that talks to R in order to produce the R data frame (an .RData file) is RemoteDataFrameService.java. It uses a few other classes in edu/harvard/iq/dataverse/rserve, and some code in edu/harvard/iq/dataverse/dataaccess/DataConverter.java. There is also some fixed R code in the Dataverse (edu/harvard/iq/dataverse/rserve/scripts/dataverse_r_functions.R) that is passed to R in real time. The communication to R is done using Rserve protocol. The Rserve server can run on your local systems, or you can use remote Rserve running on one of our test boxes - for example, on dvn-build.hmdc.harvard.edu.

It's based largely on some much older code written by another developer during an earlier stage of the project. Looking at it, it's pretty complicated stuff. I'm wondering if this issue would be better suited for somebody from the full-time developers team. So do not hesitate to ask for help and; or even jump on something else if you need to.

The map of all the categorical values for the datafile is passed to the RemoteDataFrameService as sro.getValueTable(). Then we turn it into a map on the R side called VALTABLE; and that's what R will use to change the vectors made from the values in the tab files into "factors" - R's version of categorical variables.

One solution for the issue at hand is to add some R code instructing it to assume that the VALTABLE may not contain the labels for some values found in the tabular vector. And not to assume N/A, but to just use the string value of the element as the label (i.e., if the value in the numeric vector is 1, but there is no label in VALTABLE corresponding to 1, juse use "1" for the label).

There is an alternative solution though, that would not require any extra R code, but can be done solely on the Dataverse side and in Java: When we create the valueTable in sro (RJobRequest.java) - this happens in DataConverter.getValueTableForRequestedVariables(...)) - we can, in addition to going through all the category labels in DataVariable.getCategories(), subset the vector for this variable from the tab file, got through the values, and if necessary, create extra labels, for all the unique values that are not in the dataVariable.getCategories()... (Oscar, you should have a rough idea of how to subset individual vectors, from looking at the code that calculates UNFs and summary stats...)

The drawbacks of this method - having to read and subset the tabular file on the application side. (this of course would have to be done for categorical variables only, not for every vector - but still!). And R will need to read the tabular file anyway - so that would be the reason to solve it by adding something on the R side... But that comes with having to figure out what goes on in that R code.

oscardssmith commented 6 years ago

@stevenmce given that in R categorical columns can not have data not assigned to a category, what path do you want us to take her?

  1. For each unlabeled value, produce a new label for that value which is str(value)
  2. Prevent these arguably incomplete files from being downloaded in R
  3. Prevent these arguably incomplete files from being ingested
pdurbin commented 6 years ago

@izahn suggests using a labeller built into haven: https://cran.r-project.org/web/packages/haven/vignettes/semantics.html

izahn commented 6 years ago

See https://haven.tidyverse.org/articles/semantics.html for tools designed to preserve metadata from other statistics packages in R.

pdurbin commented 6 years ago

I'm not sure if this will be helpful or not but I found https://cran.r-project.org/package=DDIwR and the PDF says, "This package provides various functions to read DDI based metadata documentation, and write dedicated setup files for R, SPSS, Stata and SAS to read an associated .csv file containing the raw data, apply labels for variables and values and also deal with the treatment of missing values."

djbrooke commented 6 years ago

There was a suggestion at standup to wait until the Stata update issue (#2301) was merged to develop (and this code is updated from develop) to CR this, as the diff would be cleaner. This is OK with me, but @oscardssmith and @scolapasta were going to have a quick discussion to see if there was some way that made sense to move it forward before then.

pdurbin commented 6 years ago

the diff would be cleaner

For the clean version of the diff: https://github.com/IQSS/dataverse/compare/2301-stata...4676-mixed-labels-in-R

djbrooke commented 6 years ago

Thanks @pdurbin.

@oscardssmith and @scolapasta I'm going to remove you both since this is what we were after - a way to review the code without a confusing diff. Anyone with the cycles to do so should review this. If I missed something, reassign yourselves.

matthew-a-dunlap commented 6 years ago

I took myself off review on this issue, as I don't think my review will mean much without a lot of getting up to speed and I really shouldn't have taken it on. This is a good one for @landreev when he is back.

landreev commented 6 years ago

What is the deal with the "isLabled" filed in the DataVariable entity? If I build from this branch, but try to deploy using an existing database, ingest stops working - because my "datavariable" table does not have the column "islabled"; so our normal procedure is to provide a db upgrade script... I've checked it in - scripts/database/upgrades/upgrade_v4.9.1_to_v4.9.2.sql:

ALTER TABLE datavariable ADD COLUMN ISLABLED BOOLEAN;

But, could you please add a comment explaining what this field is actually for... Because, on closer inspection it doesn't look like it's being used for any practical purposes (and we appear to be setting it to TRUE for every variable, in the old and new Stata plugins at least)... I may be missing something of course.

landreev commented 6 years ago

I like the idea of using this R package (haven) for directly converting the saved stata and spss files into RData frames. But there are some issues that need to be addressed. But there are some issues that need to be addressed. First of all, this package needs to be installed on the R server that the dataverse instance is configured to talk to. (It's not one of the standard packages that come with R preinstalled). For the existing Dataverse installations (such as our prod. one), we'll need to add something to the release notes instructing them to install the package. For new installations, we'll need to add haven to the list of required third party R packages that they need to install in order to run RData ingests, conversions and (optionally) TwoRavens. This overlaps with the issue #4753 (and other recent issues and PRs that dealt with R 3.5). So this can be addressed there, or maybe it deserves an issue of its own - but it needs to be addressed.

oscardssmith commented 6 years ago

The idea behind this is (if I've implemented it right) is that with this change, we are considering stata and spss lacked data as a distinct concept from R factors. that said, neither it, nor iscategorical are ever used with either of these types, so I'd be happy to nix this change.

landreev commented 6 years ago

EDIT: Retracting the comment below - I simply misunderstood what was going on in that code (see my "if I'm reading the code correctly" there). This is NOT an issue - the old pathway is preserved, and is actually used when the preserved original happen not to be SPSS or Stata.

Another issues is with the tab files that are not ingested from either Stata or SPSS. If I'm reading the code correctly, it appears that if the saved original is a dta, sav or por file, we transfer the file to rserve and call a direct haven read_* method on it; but it will just fail is the original format is Excel or csv. It seems like for these cases it would be wise to keep the old way of generating the RData file - by sending the tab file and the variable metadata to rserve. The variable metadata info for tab files ingested from csv and excel should be relatively minimal (for example, there should never be categorical variable in csv and excel ingests!) - so the limitations in that old R code should not be relevant (correct?)

landreev commented 6 years ago

There is also the case specific to our production (and maybe to one or two particularly old external installations, such as Odum): we have a limited number of ingested tabular files for which we don't have the saved originals/don't have the original format preserved at all. The files in question were grandfathered from something called "VDC" - the dinosaur ancestor of the Dataverse application. Some of these old files happen to be very very important (for example, many data files from Gary King's datasets are part of this subset). So we don't want these R conversions to just start failing for these files. (granted, for most or all of Gary's files, the RData files are already cached; so one way to address this would be to keep them in place, and only wipe the cached RDatas for the files that were specifically affected by the missing label issue. Also, we were recently discussing working with Murray project to restore the original for these old tabular files. So this may get resolved outside of this issue - but still, this is an issue).

landreev commented 6 years ago

A couple more things: Generally, it makes perfect sense, to want to use a piece of software written in R, by R programmers, and specifically for the purpose of converting foreign formats to RData. Rather than writing and maintaining code in R ourselves, which is not our strongest point. Whoever, there's one potential issue with this approach: the above is only true for the tabular ingest as it is implemented now. That is, when we convert proprietary formats (stata, spss, etc.) into our internal format ONCE, with no way to modify it later on. But we are already working on providing APIs for modifying existing tabular metadata. So this creates a possibility of a use case where the owner has improved the descriptive tabular metadata - with better variable names, labels, etc. - which is reflected in the DDI, json exports, and/or in TwoRavens etc. Yet if they try “download as RData”, users will see the original, “poor” metadata from the saved original.

I don’t know if we want/need to discuss this further. This PR definitely addresses the current issue; and this will not be a problem for the foreseeable future. But we need to assume that we will likely have to go back to learning how to properly convert our (Dataverse) tabular metadata to R data frames from scratch in the future.

In this PR however, we do need it explained somewhere in the documentation, that this is the way we convert to RData - from the saved original, using a third party R library.

landreev commented 6 years ago

Re: documentation: there's a section in the API guide, "Basic File Access" that mentions download as R: RData | Tabular data as an R Data frame (generated; unless the “original” file was in R); this is probably a good place to explain how these RData files are generated.

landreev commented 6 years ago

I requested some changes in the PR. (want to address/rename the new "isLabled" thing in DataVariable)

pdurbin commented 6 years ago

I discussed this issue with @oscardssmith and @landreev this morning and have a pretty good sense of the work that's left, messing with the new unused boolean. There's a file floating around called "BeerTastingTestData.dta" that can be used for testing. One needs to install the haven R package or use an Rserve with it (dvn-build soon, probably).

pdurbin commented 6 years ago

I made the requested changes in 03c8051 and cleared them with @oscardssmith and then @landreev . I'm attaching the file I used to ensure that the new "factor" column is being populated with a "true" when the RData file contains factors: test_factor.RData.txt

Back to code review.

landreev commented 6 years ago

@kcondon For QA, please use Rserve on dvn-build.hmdc.harvard.edu.