Babzsak / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Support value types (other than String) #140

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
At the moment there is no ability to store data in formats other than String 
even if those data are marked as not analyzed.

The RecordImpl as well as the Cleaner interface are designed to use strings as 
values. It's weird because I'm able to provide my own implementation of an 
appropriate comparator, so even binary data might be used for 
linking/deduplication.

Original issue reported on code.google.com by srg...@gmail.com on 13 Nov 2013 at 1:09

GoogleCodeExporter commented 8 years ago
It's a deliberate design choice to make the code simpler. I figured if I had to 
make Record objects support datatypes, Comparators would also need to, and 
complexity would spread throughout the whole code base. I know it's not "right" 
from a computer science viewpoint, but it does work well, and it avoids all 
that complexity.

You can still use binary data, though, you just need to represent it as a 
string (for example as a sequence of hex codes).

Do you have any specific use case for non-string, or are you just making a 
general complaint about the design?

Original comment by lar...@gmail.com on 13 Nov 2013 at 1:12

GoogleCodeExporter commented 8 years ago
Sure, it's clear that it was an deliberate design choice. 

But at the moment I'm trying to use duke in conjuction with Cassandra via JDBC 
driver and I've faced the problem of using UUIDs values. Cassandra stores UUID 
as a ByteBuffer and duke handles it as a string, so I get strings like the 
following one:
java.nio.HeapByteBuffer[pos=175 lim=191 cap=307].

And there is no chance to have UUIDs from the box.

I'm not sure, but is this a big deal to have Object instead of using String?
There is no need to have java generics because of xml configuration as well as 
no need to create "typed" columns. As for backward compatibility all data by 
default will be treated (internally) as a String. In other words Comparators, 
Records, Cleaners will operate with Objects.

Original comment by srg...@gmail.com on 13 Nov 2013 at 2:31

GoogleCodeExporter commented 8 years ago
Hmmm. Are you using the JDBC data source? If so, I could try to handle it 
there. 

Note that you're now getting a ByteBuffer, so simply having Object would not be 
much help. You don't want to pass around ByteBuffer objects inside the Duke 
core, without knowing what the buffers are connected to and when they get 
closed/invalidated.

The obvious solution would be to turn UUIDs into URLs following RFC 4122 
http://www.ietf.org/rfc/rfc4122.txt That would give you nice, readable strings 
in a standard format. It would also not require any changes to the Duke core.

Java generics wouldn't help us at all, because the types are not known at 
compile time. So basically we'd need to have either just Object, which is quite 
painful, or a bazillion different getXxxValue for various types. The trouble is 
that once you do that the cancer spreads into all the client code, and most of 
Duke uses Record objects, so I haven't sat down to study precisely how big the 
impact would be, but my gut feel is that this is a very expensive idea, 
especially in the long run.

It's not Comparators, Records, and Cleaners, sadly. The Database would also 
have to handle arbitrary objects, which is not really possible, because these 
have to be serialized. So we'd need to restrict the types to some closed set of 
types. Then comes the user interfaces that need to be able to display values to 
the user so that the user can see what's going on. And so on and so forth.

Sorry if this sounds like a strange way of thinking to you, but my main goal in 
writing software is avoiding unnecessary complexity. So far it sounds to me 
like you need to transform your ByteBuffers into something else no matter what, 
and that something else might as well be strings.

Original comment by lar...@gmail.com on 13 Nov 2013 at 3:11