IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 489 forks source link

Ingest: Update Stata 13 ("New Stata") ingest plugin; add support for v.14 and v.15; fix bugs in handling of v.13 format #2301

Closed kcondon closed 6 years ago

kcondon commented 9 years ago

This issue is for updating the ingest plugin for the the 2nd-gen. Stata format ("Stata 13" or "dta 117"), DTA117FileReader.java. See below for a primer on the Stata formats and ingest plugins copy-and-pasted from the "spike" issue #4408. The tasks involved are 1) adding support for the versions 14 and 15 (both are minor variants of this 2nd-gen. format); 2) and fixing bugs in the handling of the v.13 format proper, that result in some dta117 files not being ingested as tabular data. 1) can be achieved by consulting the format documentation on the Stata site and modifying the plugin accordingly. Then the updated plugin can be tested on all the uningested dta 118 and dta 119 files currently in our prod. holdings. For 2) we can consult all the IngestReport entries saved since 4.0 and find all the Stata 13/dta 117 ingest failures, and debug the plugin until they pass the ingest (or we verify that the failing files are in some way corrupt and are indeed uningestable).

From #4408:

We currently maintain 2 Stata ingest plugins: one for Stata 13 (Stata's internal format "dta 117") and one for the older versions. Their v.13 format was re-engineered completely from scratch. It's very different from the older formats, so it warranted a new and separately maintained piece of ingest code.

Having reviewed the format documentation quickly, the good news is that the newer formats appear to be merely an extension of Stata 13; and not new developments. So we don't seem to need a new ingest plugin - rather we should be able to simply teach the current "new Stata" ingest to understand the latest flavors of the format.

There's been 2 format variations since v.13:

Stata 14 ("dta 118") Stata 15 ("dta 119")

A very large portion of the v.14 format specification document appears to be 1:1 identical to the v.13 spec. I'm seeing some minor differences (For ex., in the later version, the number of observations is encoded as an 8 byte integer, in the v.13 it was 4). It'll take more careful work to identify all such differences, but it seems manageable.

The v.15 is explicitly advertised as exactly the same as v.14, with the single exception: the later format allows for more than 32K variables.

The conclusion is: it appears to be possible to add support for v.14 and 15 by extending and improving the already existing code. We should definitely add support for both of these at the same time (since v.15 is a minor extension of v.14).

tlchristian commented 8 years ago

Just wanted to check in to see the status on this issue. Our concern is that without Stata 14 support, Dataverse isn't generating that .tab derivative for preservation purposes. Once Stata 14 is supported, will the system be able to create the .tab subsettable file?

djbrooke commented 6 years ago

@sbarbosadataverse - you said you wanted to discuss?

donsizemore commented 6 years ago

At the risk of butting in... would you want to add ingest support for Stata 15 to this issue as well (if it isn't supported already)?

djbrooke commented 6 years ago

Good idea. Maybe? @sbarbosadataverse and I will discuss early next week (not sure what she has in mind) and follow up here. Have a good Thanksgiving, @donsizemore!

pdurbin commented 6 years ago

We discussed this issue in our backlog grooming meeting today and created #4408 as a spike for more investigation.

@tlchristian I'm not sure I understand your question. Support for newer versions of Stata would work the same as the versions Dataverse supports now, I assume.

tlchristian commented 6 years ago

@pdurbin I think that I wanted to be sure that newer Stata versions would work the same way for older versions in Dataverse. As of now, we've had issues with Dataverse generating the .tab derivative, and want to be sure that still happens for Stata 14 and future versions.

landreev commented 6 years ago

Modified the description at the top of the issue, added some info.

landreev commented 6 years ago

Links to the format documentation:

v.13/("dta 117"): https://www.stata.com/help.cgi?dta_117 v.14/("dta 118"): https://www.stata.com/help.cgi?dta v.15/("dta 119"): https://www.stata.com/help.cgi?dta_119

pdurbin commented 6 years ago

In 079a59a I pushed a new DTA118FileReader which is mostly a copy and paste of the DTA117FileReader class with tweaks here and there for the expected number of bytes to read for the newer 118 format (Stata 14). Thank you, @landreev for the links to the format descriptions above!

One particularly tricky issues has been the offset in the readValueLabels method. At least for the file I'm testing, the offset is not zero for one of the variables so I put in a terrible hack specific to that file for now:

if ("matching".equals(label_table_name)) {
    if (i == 0) {
        label_offset = 0;
        label_end = 12;
    } else if (i == 1) {
        label_offset = 12;
        label_end = 25;
    }
}

Obviously, this needs to be removed but with this hack in place and the various other changes in that commit, I'm able to get a Stata 14 file to ingest. The file I'm testing with is completely arbitrary. It's just one I found in Harvard Dataverse when I searched from "Stata 14": https://dataverse.harvard.edu/file.xhtml?fileId=3140457

Here's how that Stata 14 file looks at https://dev1.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/LBOJTF

screen shot 2018-05-25 at 3 26 33 pm

I'm now testing another arbitrary Stata 14 file from https://dataverse.harvard.edu/file.xhtml?fileId=2775556

@tlchristian and others, do you have any particular Stata 14 or 15 files you want me to test? If they're public, please link to them here. Thanks!

matthew-a-dunlap commented 6 years ago

I've started digging into the STRLs (long string) support in our parser. It seems that the format has changed a fair bit between 13 and 14. I started comparing the specs in the two files and the 14 spec is over twice as long 13. I'll keep reading and continue on the actual parsing tomorrow.

FWIW, at the start of our Stata 13 STRL parser method, there is a comment stating

    /* 
     * STRLs: 
     * (simply skipping these, for now)
    */

I'm pretty sure this is a stale comment and parser is actually doing this processing in full. Something to dig into more tomorrow.

landreev commented 6 years ago

@matthew-a-dunlap Looks like a stale comment, yes. I probably forgot to remove it once the STRL-reading code was created.

pdurbin commented 6 years ago

https://services.dataverse.harvard.edu/miniverse/qc/bad-ingest-list via #3675 is turning out to be a good place to look files that failed ingest. Thanks, @raprasad ! Here's a screnshot:

screen shot 2018-05-31 at 1 25 08 pm

landreev commented 6 years ago

Please let me know if you want me to batch-select all the uningested Stata 13s, and, separately, 14-15 in prod. and upload them somewhere local for you. Will this help at all?

Aside from that, it looks like there's a large enough crew working on this issue - so I may focus on something else instead while I have some free cycles. But, once again, do let me know if you have any specific questions.

matthew-a-dunlap commented 6 years ago

@landreev If its not a lot of work it'd be pretty great, would help Ben and the rest of us test out the broken files

I thiiink at this point Phil and I are wrapping our heads around this and there are enough of us. Now that I have a grip on the format I think StrLs work and it was just some minor byte count tweaks. We may need some help tho as we try to handle some of the outstanding failed cases.

landreev commented 6 years ago

And I'm still willing to look into that specific issue (the "negative offset" in the value labels; that apparently is a problem in 13 as well?).

Ok, I just took a quick look at the "value labels" section of the Stata 13 doc (and it does appear to be the same in 14-15 too - right?); and I honestly don't remember reading this back when I was implementing the v.13 plugin:

It is not the out-of-order values of off[] that cause problems; it is
    out-of-order values of val[].  In terms of table construction, we find it
    easier to keep the table sorted as it grows.  This way one can use a
    binary search routine to find the appropriate position in val[] quickly.

I'm wondering if this wasn't advertised originally?

pdurbin commented 6 years ago

@landreev @matthew-a-dunlap thanks for the updates and thanks to @benjamin-martinez for being willing to jump in. We need to converge on some sort of definitions of done before we move pull request #4708 to code review and QA. I'm adding a todo list below but let's talk about it and edit it as needed.

To do list:

landreev commented 6 years ago

OK, I collected a whole bunch of uningested v.13 and v.14 Stata files, on dvn-build.hmdc.harvard.edu, in /usr/local/dvn-admin/stata. (stata-13 and stata-14 subdirs). There are 150 v13 files and 900+ v14 files. 1GB+ total. This is NOT all the uningested stata files - just the smaller ones, to keep the total size manageable. There are a few hundred files more - but the size quickly rises up to tens of gigs.

landreev commented 6 years ago

(let me know if you want me to put this stuff up somewhere http-accessible)

landreev commented 6 years ago

Reading this part again:

In terms of table construction, we find it
    easier to keep the table sorted as it grows.  This way one can use a
    binary search routine to find the appropriate position in val[] quickly.

I don't think this is something we need to worry about (especially the binary search part). They appear to be talking about maintaining a large Stata file that's growing in real time; with the user modifying and/or adding labels. And we are only dealing with this file as a static object, with the tables already finalized and stored for us; so we only need to read them once, and we can do it sequentially.

But there's obviously some trickiness with the offsets, etc., some weird cases that the original v13 plugin did not implement. So I'd like to help figure that part out.

landreev commented 6 years ago

The controversy with the value label offsets has a simple explanation. The current implementation assumes that the table of offsets is sorted. It looks like it's sorted most of the times, but it's not guaranteed to be. So, for example, in this file (15ad2b227a2-8c28fdf17537.dta in the test batch) this variable has 3 value labels:

Not sure
Yes, mistake
No, not mistake

but the offsets appear to be in the following order: 22, 0, 9 So the solution is to maintain a second, sorted copy of the table; and to indeed perform a binary search on it, in order to determine where the current label ends... Will check in a fix shortly.

This fix should cover at least some of the "DataReader.readBytes called to read zero or negative number of bytes." failures in the spreadsheet.

wbuchanan commented 6 years ago

I could be wrong, but aren’t the labels terminated with a binary zero? If so, why not just stream the bytes until the binary zero is encountered and then move to the next offset?

landreev commented 6 years ago

@wbuchanan I actually thought about mentioning it as a possible solution. But then I started thinking, are we really guaranteed that there would never be a zero character in the middle of a value label? - And sorting a table, and then running a quick search on it (and only doing it if we have to - only if the supplied table is not sorted already) really takes 5 extra lines of code.

wbuchanan commented 6 years ago

The string 0 could most definitely occur in a value label, but I would seriously doubt if someone would go through the trouble to figure out how to code a binary 0 in a value label. Binary zeros are also how Stata stores string data and strLs in the files as well.

landreev commented 6 years ago

No way to copy-and-paste something with a binary zero into a label in Stata? You are most probably right. Although I've already written a sorted list solution. Relying on zero bytes would not be insanely simpler though. If you really want to be able to "stream" (as in, read one label at a time) - then you need to run a reverse search, to determine what defined value goes with the label at the current offset... So a better solution would be to read the remaining <lbl> section bytes into memory and then go through the offsets and then look for the next zero byte... same amount of code, I'm guessing.

But, your solution doesn't need an extra array.

landreev commented 6 years ago

OK, I checked in a quick fix for the unsorted value labels offset table. Could be implemented differently, as was suggested above. But it works. I'll do more testing, to make sure the labels are actually ingested correctly. (And when we are done, we will need to rerun the updated v.13 plugin on all the v.13 dta files in prod. that had been successfully ingested too - to confirm that it's still producing the same results).

For now I'm back to working on the BARI project though.

landreev commented 6 years ago

This is seriously embarrassing stuff though. That we had this bug in the reader for so long. My only consolation is that it was actually blowing up/giving up when encountering unsorted offset tables; as in, ingesting the file, but mixing up the value labels out of order would be way worse. Same for the buffer edge bug in the checkTag() method. Just reading the comments in the class, I do remember now that my assumption was that it was a beta essentially, and that we would go back to working on the plugin in a few months max. It's kinda sad we didn't get to it until now.

matthew-a-dunlap commented 6 years ago

Running @landreev 's fix on the errored Stata 13 files shows all the "DataReader.readBytes called to read zero or negative number of bytes" files now passing! See the updated info in the drive doc

matthew-a-dunlap commented 6 years ago

I ported the unsorted val fix to Stata 14 and ran our bulks tests (before and after). More now work but a few are now broken trying to call readBytes on 0 bytes. Not sure why we aren't seeing this in 13. I'll investigate more

matthew-a-dunlap commented 6 years ago

I pushed a small fix for when we were attempting to read zero bytes from the value_label_table, and then triggering our own read_bytes error handling. value_label_table can be 0 bytes long in both Stata 13/14, tho I was only able to find examples of it in 14. The fix was pushed to both parsers.

pdurbin commented 6 years ago

@oscardssmith and @benjamin-martinez have been working away on pull request #4708 and the checklist at https://github.com/IQSS/dataverse/issues/2301#issuecomment-393909691 and we should continue to check off items on the list there but one thing I emphasized to them is that we need to switch over from testing with JUnit to testing in the UI because there is some UI-specific code that needs to be exercise:

landreev commented 6 years ago

I pushed in my commit, removing and/or rewriting some old comments, many TODOs that are no longer relevant. I left a couple of TODOs in place for the future. One of them I will double-check on now - the missing values for strings, vs. empty strings. Aside from that, there's one thing I have question about ; @oscardssmith could you please check on the following: line 1173

category_values[i] = reader.readUInt();

the use of readUint() here suggests we can't read a negative category value. Could you try and create a Stata 13+ file, with a numeric vector with some negative values, and then assign some value labels to them... is it possible? and are we going to read the values incorrectly on ingest?

landreev commented 6 years ago

@kcondon This is ready for QA. I'll point you to a whole pile of Stata 14 files (that were uploaded, but not ingested in prod.); and a (smaller) pile of Stata 13 files, that the previous version of the plugin failed to ingest.