complete typedbytes implementation

piccolbo commented 12 years ago

currently type code 4,5, 9 and 10 are left unimplemented because of time and because they don't have an obvious match in R. 4 corresponds to long integers, 5 to single precision floats, 9 is a different type of vector that is terminated by a 255 instead of having a length header and 10 is a map. I think 4 could be implemented using package int64, 5 could be treated like numeric wasting some space, 9 it's going to return a list but it's just a matter of parsing it differently. 10 is the most interesting. Given its very general definition (like typedbytes vectors and lists it is polymorphic, each key and value carry their own type information) I think there is not option but a list of lists (pairs), maybe using they keyval pair as a model. Named vectors also are poorly supported right now as names are just dropped, and 10 could be used to support that case better, but that would conflict with its more general definition.

piccolbo commented 12 years ago

to be able to read the google ngram dataset as available on amazon's S3 I had to implement 4 (long ints). They are converted to numeric following Rcpp behavior. Hopefully this wil change to int64 in future releases (this feature was pulled from a previous version because of backward compatibility problems). Please do not rely on this conversion being the final behavior.

jamiefolson commented 12 years ago

It would be great if maps could be implemented as named lists when the keys are strings. On a related note, it'd be great if names(list) could be preserved without using the 'native' format.

piccolbo commented 12 years ago

What do we do when the keys are not strings? And isn't the second question the same as the first or am I missing something?

jamiefolson commented 12 years ago

1) Yeah, that's a problem. I don't know if you could do a named list when the keys are strings and a list of list(key,value) like you were saying otherwise. I'm not sure if that's too confusing, it just seems really natural to have lists be maps. I just like the thought of converting to the R structures closest to those you'd want to have. Think about retrieving an value for a key. In a list of lists, that's kind of awkward.

Actually, I think the best way to deal with maps with non-string keys would be a list of keys and a list of values. This kind of structure would come much closer to offering the kinds of get/set operations that you'd want from a real map data structure.

On a related note, it may be desirable to have a simplify option when making a typedbytes input that would unlist code 8/9 vector/list objects.

2) Maybe. Using lists as maps would be one way to accomplish preservation of names(list). Another could be another custom type, but that could get silly.

jamiefolson commented 12 years ago

One potential problem with the way rmr 145-code works is for NULL and NAN values. Hive's typedbytes implementation stores null values with typecode 12, which seems reasonable. NAN values could be serialized similarly. However, this representation is fundamentally inconsistent with 145, where the common typecode for all values is written first. In this format, an "array" of numeric values cannot contain NULL values, which have a different typecode.

EDIT: I just checked, and currently, NULL values simply vanish from serialized "array" objects.

piccolbo commented 12 years ago

We have to deal with the streaming implementation of typedbytes, where typecodes 11 and 12 are not supported. I don't know what "similar" means in a serde context. It's either one way or the other. 145 was invented for homogenous collection, you can use typecode 8 for heterogeneous ones.

On Sat, Nov 3, 2012 at 8:42 AM, jamiefolson notifications@github.comwrote:

One potential problem with the way rmr 145-code works is for NULL and NAN values. Hive's typedbytes implementation stores null values with typecode 12, which seems reasonable. NAN values could be serialized similarly. However, this representation is fundamentally inconsistent with 145, where the common typecode for all values is written first. In this format, an "array" of numeric values cannot contain NULL values, which have a different typecode.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10040565.

jamiefolson commented 12 years ago

Sorry, by "similar" I just meant with a different "application-specific" typecode, although those are binary, so you'd have an extra '0000' after the typecode.

So, at least in Hive, NULL values are semantically closer to NA in R. A Hive ARRAY might contain nulls. It is not possible to use the 145 format combined with typedbytes null values because they're a different type. In an array (145), missing values must be of the typedbytes type that non-missing values are.

Currently, it looks like missing numeric values are written as binary: <7f f0 00 00 00 00 07 a2> whereas character NA is written as just "NA". I'm not sure if these actually get parsed back into R NA values.

Are these aspects of the R internals and how missing values are actually stored?

I understand that these are not really issues for you yet, since streaming does not yet support them, and I appreciate your willingness to help. I'm off on my own, trying to get data into a typedbytes format that will retain as much structure as possible. Right now, both R and Hive/Hadoop have concepts of null/missing values, but there are some conceptual and technical incompatibilities that I think are worth planning for even if you won't be supporting them until Hadoop streaming gets on board.

piccolbo commented 12 years ago

Everything is written as binary with the native or sequence.typedbytes format, no exceptions. I have to say that I am not sure what happens to NAs,it looks like they are converted to logical FALSE in logical but they break numeric vectors. Looks like a bug,

On Sat, Nov 3, 2012 at 9:29 AM, jamiefolson notifications@github.comwrote:

Sorry, by "similar" I just meant with a different "application-specific" typecode, although those are binary, so you'd have an extra '0000' after the typecode.

So, at least in Hive, NULL values are semantically closer to NA in R. A Hive ARRAY might contain nulls. It is not possible to use the 145 format combined with typedbytes null values because they're a different type. Basically, missing values must be of the typedbytes type that non-missing values are.

Currently, it looks like missing numeric values are written as binary: whereas character NA is written as just "NA". I'm not sure if these actually get parsed back into R NA values.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10041070.

jamiefolson commented 12 years ago

Certainly everything goes to either native or typedbytes. I'm saying that there's a bit of a conceptual disconnect between the concept of null between Hive/Hadoop and R, with Null often used (at least in Hive, where it's implemented in typedbytes) to mean something closer to R's NA.

There are also some problems with 145 and Null/NA. One, as you pointed out, seems to be a bug caused by the fact that R actually has an internal NA_ SEXP storing the NA values in memory. Serializing them and reading them back in does not necessarily produce NA values (at least for strings and logicals). For example NA_STRING has a "value" of "NA"(checkout the R source:src/main/names.c), but simply serializing such a string and then deserializing will produce the string "NA" rather than NA_STRING.

The second, but related issue, is that there's currently no way to serialize arrays(145) with missing values from other applications. This may be okay, but seems less than ideal.

For me, this means for now I'm going to have to go back to serializing things as vector(8) and then do some conditional unlist-ing on the map/reduce-side (as well as implement typedbytes null(12) as NA).

jamiefolson commented 12 years ago

So it looks like the numeric NA values, ie NA_INTEGER and NA_REAL, are just double values, not unique cached object references, so they "should" be recognized as NA values when deserialized. I haven't checked this yet, though. This should also allow numeric NA values to be serialized by other tools since R's values for these are IEE standards, Integer.MIN_VALUE for integer NA and floating point sNaN with a payload of 1954 (see src/main/arithmetic.c).

piccolbo commented 12 years ago

A good serde should cover everything, even these pretty obscure to me "unique cached object references". I think the rest you said about NA is correct and my experiments were too cursory, it appears that NA.s are serialized correctly in typedbytes for numeric, integer and character, but not for logical, where it comes back as FALSE. We seem to disagree on the character case. This is what I see:

rmr2:::typed.bytes.writer(objects=list(c("A","b",NA)), con=file("/tmp/NA", "wb"), native=T); gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 453609 24.3 818163 43.7 780453 41.7 Vcells 628570 4.8 1162592 8.9 1077906 8.3 Warning messages: 1: closing unused connection 5 (/tmp/NA) 2: closing unused connection 4 (/tmp/NA) 3: closing unused connection 3 (/tmp/NA)

rmr2:::typed.bytes.reader(readBin(con=file("/tmp/NA", "rb"),what=raw(),1000),100) $objects $objects[[1]] [1] "A" "b" NA

$length [1] 53

If I switch native to FALSE then the conversion to "NA" happens. But unless we can change streaming there isn't much we can do about it.

On Sat, Nov 3, 2012 at 9:44 PM, jamiefolson notifications@github.comwrote:

So it looks like the numeric NA values, ie NA_INTEGER and NA_REAL, are just double values, not unique cached object references, so they "should" be recognized as NA values when deserialized. I haven't checked this yet, though. This should also allow numeric NA values to be serialized by other tools since R's values for these are IEE standards, Integer.MIN_VALUE for integer NA and floating point sNaN with a payload of 1954 (see src/main/arithmetic.c).

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10047467.

jamiefolson commented 12 years ago

My bad, I've extended the typedbytes implementation to use array (145) for character vectors, but I also think I was using typedbytes instead of native. I looked through the code and assumed it was happening in master's implementation, too.

I'll take a look. I expect the difference is because of using array(145) to serialize character vectors, which means serializing the actual character value rather than whatever R's native serialization does. So it's not going to be quite as simple to support character vectors in array (145). Not sure how to handle it, but UTF-8 has a private use block, so I could just pick an arbitrary coding for NA-string? Do you have any ideas for the comparable issue for logicals? It's worse there because you don't have any extra bits to use. Numerics work because R uses the floating point spec for NaN and just adds a NA-specific payload. This way their simultaneously valid numbers and not numbers.

piccolbo commented 12 years ago

I am listening but I am also not understanding how R sees things. Try these

as.character(NA) [1] NA

not "NA", isn't this what we need?

charToRaw(as.character(NA)) [1] 4e 41 rawToChar(charToRaw(as.character(NA))) [1] "NA" #not this! is.na(NA) [1] TRUE is.na("NA") [1] FALSE is.na(as.character(NA)) [1] TRUE

On Sun, Nov 4, 2012 at 10:10 AM, jamiefolson notifications@github.comwrote:

My bad, I've extended the typedbytes implementation to use array (145) for character vectors, but I also think I was using typedbytes instead of native. I looked through the code and assumed it was happening in master's implementation, too.

I'll take a look. I expect the difference is because of using array(145) to serialize character vectors, which means serializing the actual character value rather than whatever R's native serialization does. So it's not going to be quite as simple to support character vectors in array (145). Not sure how to handle it, but UTF-8 has a private use block, so I could just pick an arbitrary coding for NA-string? Do you have any ideas for the comparable issue for logicals? It's worse there because you don't have any extra bits to use. Numerics work because R uses the floating point spec for NaN and just adds a NA-specific payload. This way their simultaneously valid numbers and not numbers.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10053330.

jamiefolson commented 12 years ago

So what R does is it creates an instance of a SEXP character string called (I believe) NA_STRING. That string is set to the value "NA". Then the object R_NaString is somehow cached with SET_CACHED(). Any and all comparisons, tests, or assignments to NA for a string, will use that instance for comparision (==). Because NA_STRING is an SEXP (which is a pointer), I believe these comparisons are simple identity checks. They will only be true if the pointers refer to the same object.

SEXP newstr = Rcpp::wrap("NA"); newstr == NA_STRING; // Is false

In contrast, NA_REAL and NA_INTEGER are just double/int, respectively, and comparisons with NA_REAL and NA_INTEGER compare the contents of the SEXP, and so will evaluate as expected.

SEXP newint = Rcpp::wrap(INT_MIN);// I don't remember to hex value REAL(newint) == NA_INTEGER;// Is true

One way to think about it is that NA_REAL and NA_INTEGER (also used for factors, I think) are actual values to compare with, whereas NA_STRING is a reference to a specific object that only exists within the R session. You can see this in the definitions. NA_STRING is an SEXP, whereas NA_INTEGER is an int and NA_REAL is a double.

My thought is that you could serialize NA strings as a UTF-8 Private Use Area code, like U+F8FF, which is represented as 0xEFA3BF. Then you have the somewhat silly task of comparing that to every string you read in to see if should be set to the current NA_STRING.

piccolbo commented 12 years ago

So NA_STRING per Rinternals.h is just a define for R_NaString. I can't find the definition of the latter. From email conversations about the role of NA (https://stat.ethz.ch/pipermail/r-devel/2002-March/024101.html) it seems this is just a special value, but until I see the source code I won't know for sure.

On Sun, Nov 4, 2012 at 2:50 PM, jamiefolson notifications@github.comwrote:

So what R does is it creates an instance of a SEXP character string called (I believe) NA_STRING. That string is set to the value "NA". Then the object R_NaString is somehow cached with SET_CACHED(). Any and all comparisons, tests, or assignments to NA for a string, will use that instance for comparision (==). Because NA_STRING is an SEXP (which is a pointer), I believe these comparisons are simple identity checks. They will only be true if the pointers refer to the same object.

SEXP newstr = Rcpp::wrap("NA"); newstr == NA_STRING; // Is false

In contrast, NA_REAL and NA_INTEGER are just double/int, respectively, and comparisons with NA_REAL and NA_INTEGER compare the contents of the SEXP, and so will evaluate as expected.

SEXP newint = Rcpp::wrap(INT_MIN);// I don't remember to hex value newint == NA_INTEGER;// Is true

One way to think about it is that NA_REAL and NA_INTEGER (also used for factors, I think) are actual values to compare with, whereas NA_STRING is a reference to a specific object that only exists within the R session. My thought is that you could serialize NA strings as a UTF-8 Private Use Area code, like U+F8FF, which is represented as 0xEFA3BF. Then you have the somewhat silly task of comparing that to every string you read in to see if should be set to the current NA_STRING.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10056712.

jamiefolson commented 12 years ago

Yeah, it gets confusing. R_NaString does not get defined, but NA_STRING is assigned to a newly allocated char sexp in src/main/names.c

EDIT: Ha. Just read that email. That would've been a bit surprising. But yeah, you pretty much need to look at the source code.

piccolbo commented 12 years ago

Well, NA_STRING is a define so that is where R_NaString is defined and you are right that that must be a pointer.

Antonio

On Sun, Nov 4, 2012 at 9:41 PM, jamiefolson notifications@github.comwrote:

Yeah, it gets confusing. R_NaString does not get defined, but NA_STRING is assigned to a newly allocated char sexp in src/main/names.c

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10061147.

piccolbo commented 11 years ago

All but typecode 5 are now implemented, but testing could be more thorough (there isn't a one to one match with R types so roundtripping doesn't help) at least in the hbase-io branch. maps are deserialized to a list of lists, the first keys and the second values, just as for keyval objects.

jamiefolson commented 11 years ago

Are you able to handle NA values? I've been serializing them as tb null (12) and deserializing as NA_INTEGER, which seems to work well enough. NA_INTEGER will safely convert into any other type without changing the type. In contrast, putting NA_REAL into a list/vector (e.g. of integers) will coerce the vector or unlist(list) into reals. Coercion works fine even with a string list/vector. The only problem with this approach is that it doesn't work for array (145) since tb null (12) is a different tb type.

In serializing, I'm using 'ISNA(x)' to test for NA numbers and 'x == NA_STRING'

The only problem with this representation that I've found is that it conflates R's NULL and NA. For my purposes, this is acceptable because I really need to support missing values in typedbytes and tb null (12) seems the only reasonable way to do it right now.

piccolbo commented 11 years ago

At the cost of repeating myself, NA and NULL are deeply different things in R. It may work for your application, but it won't work for a library that has the ambition to be useful for more than one use case. 12 is a Hive specific extension, it won't work with streaming. The typedbytes parser will choke on it. I would like to wrap up this conversation and move on to the next subject.

On Wed, Nov 14, 2012 at 8:07 AM, jamiefolson notifications@github.comwrote:

Are you able to handle NA values? I've been serializing them as tb null (12) and deserializing as NA_INTEGER, which seems to work well enough. NA_INTEGER will safely convert into any other type without changing the type. In contrast, putting NA_REAL into a list/vector (e.g. of integers) will coerce the vector or unlist(list) into reals. Coercion works fine even with a string list/vector. The only problem with this approach is that it doesn't work for array (145) since tb null (12) is a different tb type.

In serializing, I'm using 'ISNA(x)' to test for NA numbers and 'x == NA_STRING'

The only problem with this representation that I've found is that it conflates R's NULL and NA. For my purposes, this is acceptable because I really need to support missing values in typedbytes and tb null (12) seems the only reasonable way to do it right now.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10371735.

jamiefolson commented 11 years ago

So are you saying that you do not see rmr supporting missing values(ie NA) in typedbytes(other than implicitly with R's internal representation of numeric NA as NaN)?

Sorry if it seems like I'm beating a dead horse. I just really don't want to keep having to have my own version of rmr's typedbytes implementation just to be able have missing values.

Jamie Olson

On Wed, Nov 14, 2012 at 12:45 PM, Antonio Piccolboni < notifications@github.com> wrote:

At the cost of repeating myself, NA and NULL are deeply different things in R. It may work for your application, but it won't work for a library that has the ambition to be useful for more than one use case. 12 is a Hive specific extension, it won't work with streaming. The typedbytes parser will choke on it. I would like to wrap up this conversation and move on to the next subject.

On Wed, Nov 14, 2012 at 8:07 AM, jamiefolson notifications@github.comwrote:

Are you able to handle NA values? I've been serializing them as tb null (12) and deserializing as NA_INTEGER, which seems to work well enough. NA_INTEGER will safely convert into any other type without changing the type. In contrast, putting NA_REAL into a list/vector (e.g. of integers) will coerce the vector or unlist(list) into reals. Coercion works fine even with a string list/vector. The only problem with this approach is that it doesn't work for array (145) since tb null (12) is a different tb type.

In serializing, I'm using 'ISNA(x)' to test for NA numbers and 'x == NA_STRING'

The only problem with this representation that I've found is that it conflates R's NULL and NA. For my purposes, this is acceptable because I really need to support missing values in typedbytes and tb null (12) seems the only reasonable way to do it right now.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10371735>.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10375912.

piccolbo commented 11 years ago

I think I've beaten the Unicode tables enough by now and I came back with nothing that may be a natural candidate for NA character. We are not the only ones with this problem. SAS reportedly uses "" for a missing value, thereby removing empty strings from existence. R seems to use "NA" which is a commonly occurring string with plenty of other meanings. These choices are bugs, no gentler way of putting it. I even found a discussion about replacing "NA" with "" or "" based on some subjective idea of what is a better candidate. I guess for you to maintain your version of typedbyte is absolutely not an option, not only because we need your help on much more important things, unless you have yourself much more important things to take care of, but also because typedbytes is still changing and you would have to chase my implementation. We can't toss energies down the drain that way. One solution is that at the application level you pick a special string, like a unicode character in the private area or digest("missing value") and call that a NA. Nobody else will understand that (like nobody else would use your custom typedbytes implementation) but the chances of that value to occur in non-missing data are puny (certainly better than "" "NA" or "". Sorry I don't have a cleaner solution.

Antonio

On Wed, Nov 14, 2012 at 10:23 AM, jamiefolson notifications@github.comwrote:

So are you saying that you do not see rmr supporting missing values(ie NA) in typedbytes(other than implicitly with R's internal representation of numeric NA as NaN)?

Sorry if it seems like I'm beating a dead horse. I just really don't want to keep having to have my own version of rmr's typedbytes implementation just to be able have missing values.

Jamie Olson

On Wed, Nov 14, 2012 at 12:45 PM, Antonio Piccolboni < notifications@github.com> wrote:

At the cost of repeating myself, NA and NULL are deeply different things in R. It may work for your application, but it won't work for a library that has the ambition to be useful for more than one use case. 12 is a Hive specific extension, it won't work with streaming. The typedbytes parser will choke on it. I would like to wrap up this conversation and move on to the next subject.

On Wed, Nov 14, 2012 at 8:07 AM, jamiefolson notifications@github.comwrote:

Are you able to handle NA values? I've been serializing them as tb null (12) and deserializing as NA_INTEGER, which seems to work well enough. NA_INTEGER will safely convert into any other type without changing the type. In contrast, putting NA_REAL into a list/vector (e.g. of integers) will coerce the vector or unlist(list) into reals. Coercion works fine even with a string list/vector. The only problem with this approach is that it doesn't work for array (145) since tb null (12) is a different tb type.

In serializing, I'm using 'ISNA(x)' to test for NA numbers and 'x == NA_STRING'

The only problem with this representation that I've found is that it conflates R's NULL and NA. For my purposes, this is acceptable because I really need to support missing values in typedbytes and tb null (12) seems the only reasonable way to do it right now.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10371735>.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10375912>.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/97#issuecomment-10377410.

piccolbo commented 11 years ago

There is an improved implementation of typedbytes in the hbase-io branch. We can cherry pick that for 2.1 even if the hbase formats weren't ready by then.

RevolutionAnalytics / RHadoop

complete typedbytes implementation #97

not "NA", isn't this what we need?