THUKEG / saedb

the SAE platform
http://thukeg.github.com/saedb/
11 stars 19 forks source link

Heterogeneous #69

Closed kimiyoung closed 11 years ago

kimiyoung commented 11 years ago

You may run the mgraph_test to see the heterogeneous features. However, app programs should be modified to adjust to new API, which is not yet done.

wweic commented 11 years ago

@kimiyoung , is this all your most update-to-date code? If so, then we can just review this pull request.

thinxer commented 11 years ago

Yes, but without serialization.

On Tue, Jul 2, 2013 at 1:29 PM, Wei Chen notifications@github.com wrote:

@kimiyoung https://github.com/kimiyoung , is this all your most update-to-date code? If so, then we can just review this pull request.

— Reply to this email directly or view it on GitHubhttps://github.com/THUKEG/saedb/pull/69#issuecomment-20327322 .

wweic commented 11 years ago

@kimiyoung 's API of type builder looks good. I'll read the code and do some test.

@thinxer , @kimiyoung as for serialization, how do we store variable numbers of types of variable length with mmap? I remember your idea is create a mmap for each type and record some index information. which stage are we now?

My previous serialization code is serialize one vertex data and one edge data, does not support Heterogeneous data. So I think the missing part is an index about the whole graph's type information, the location information(in which mmap file) of each vertex and edge's data.

Oh, I saw some new fields like global_id and local_id, are these the index?

thinxer commented 11 years ago

In fact, no progress on storing the serialized data. I think we can just Load and Save the dynamic data into a normal file, without mmap. Please investigate how to do that.

kimiyoung commented 11 years ago

For a vertex, global_id is the index in the whole vertex list, while local_id is the index in the certain type. It holds as well for edges.

wweic commented 11 years ago

next is merge with serialization #59 .

thinxer commented 11 years ago

I need "InEdgesBySourceType" and "OutEdgesByTargetType" for VertexIterator, as well as "InEdgesByType" and "OutEdgesByType". Do you guys have simpler interface design for this kind of query?

thinxer commented 11 years ago

Or we can just modify the InEdges, for example, like this: InEdges(int edge_type_mask, int source_type_mask), while masks are bitmaps indicating the wanted types.

This way, we can support up to 32 types, which i think is enough.

kimiyoung commented 11 years ago

OutEdgesByType is now available; plz refer to the interface OutEdgesOfType. And InEdgesByType can be implemented in a similar way, however, not available for the moment.

kimiyoung commented 11 years ago

To support direct query for InEdgesBySourceType and OutEdgesByTargetType, maybe we need to sort edges by source type and target type respectively, map them to two independent files and build indices?

thinxer commented 11 years ago

Yes. But for now, we can just iterate through all edges, and filter out those edges we don't need.

wweic commented 11 years ago

@kimiyoung , about serialization, we can just change MappedGraphImpl's vdata_file, edata_file from unique_ptr<MMapFile> * to unique_ptr<File> *. and do serialization is easy, just write binary into related file.

the tricky part is deserialization, char ** vertex_data, char ** edge_data should stay same in GraphData, while the meaning is different. it's actually vector<data_type_i>* when we reference vertex_data[i]. and when we load the graph data, user have to associate each data_type_rank with its real c++ class type through some API. then we can get the type information to deserialize each vertex data file. In order to maintain consistency, we should manually give data_type_rank when we build the graph.

as follows:

graph.associate<VData>(1);
graph.associate<VData2>(2);

and in associate<data_type>(data_rank):

for i in 1 to count : 
    data_type t; 
    cin >> t; 
    (vertex<data_type> *)vertex_data[data_rank]->push_back(t);

how do you think? according to your familiarity with MappedGraph?

kimiyoung commented 11 years ago

@pondering To do serialization, it seems that we still need to know the exact type, otherwise we don't know how to serialize the data. May it be feasible to store every type of data as std::string in the graph, and provide serialize/deserialize API for users.

wweic commented 11 years ago

@kimiyoung , yeah, user provide the exact type by associate. once user called associate for each data type, each data type has their own associate implementation.

The serialization/deserialization logic(reload >> operator for them) is provided by user in the namespace custom_serialization_impl for their data types. And the reloaded >> will be called in corresponding associate.

for instance:

struct VData {...};
struct VData2 {...};
namespace custom_serialization_impl {  
    template <>
    struct deserialize_impl<ISerializeStream, VData> { };

    template <>
    struct serialize_impl<ISerializeStream, VData> { };

    template <>
    struct deserialize_impl<ISerializeStream, VData2> { };

    template <>
    struct serialize_impl<ISerializeStream, VData2> { };

}
associate<VData>(0);
associate<VData2>(1);

associate<VData>(0) and associate<VData2>(1) are calling different functions.

the cin >> t in associate<VData>(0) will call struct deserialize_impl<ISerializeStream, VData>, while the cin >> t in associate<VData2>(1) calls struct deserialize_impl<ISerializeStream, VData2>.

do you think this logic is clear enough? any further simplification?

kimiyoung commented 11 years ago

Users should associate twice, right? One for Save and one for Load.

thinxer commented 11 years ago

I think we need only once `associate'. We can register both serialize/deserialize functions at the same time.

wweic commented 11 years ago

@kimiyoung , @thinxer , in serialization, we need not call associate, previous code is enough, right? I guess..

thinxer commented 11 years ago

No it is not enough. We have no idea about the serialization/deserialization code of a type (I mean a type id or a type name string) in runtime. We have register them at run time for automatically serialization/deserialization.

kimiyoung commented 11 years ago

@thinxer I agree. But how to "register" a bunch of template functions?

kimiyoung commented 11 years ago

I mean is it possible to store the info like "which type matches which rank".

wweic commented 11 years ago

@thinxer , yeah, i'm wrong. the same as @kimiyoung 's question. in template functions, We can construct the functions we may use later and save them into closures, which is weird.

kimiyoung commented 11 years ago

Maybe unified storage as stringstream or string is more elegant and easier to implement.

kimiyoung commented 11 years ago

Serialization/deserialization are only needed when modifying/accessing data. What do you think?

kimiyoung commented 11 years ago

Though it may lead to inefficiency.

thinxer commented 11 years ago

No don't store function templates (which is impossible). We store the templated functions.

I see the problem. It's not possible to store different types of functions into a map.

I suggest that we modify serialization functions to make them accept "void " and deserialization functions to make them output "void ". This way we can have a unified function signature.

Or we can just call the typed serialization/deserialization functions in a lambda, as @pondering suggests.

something like this:

associate<T>(type) {
    serialize_map[type] = [&](void* t){return serialize( *((T*)t) );}
    deserialize_map[type] = [&](stringstream s){return (void*) deserialize(t)}
}

Or we just let the users to the dirty work, like @kimiyoung suggests.

What do you guys say?

thinxer commented 11 years ago

Merged.