Closed kimiyoung closed 11 years ago
@kimiyoung , is this all your most update-to-date code? If so, then we can just review this pull request.
Yes, but without serialization.
On Tue, Jul 2, 2013 at 1:29 PM, Wei Chen notifications@github.com wrote:
@kimiyoung https://github.com/kimiyoung , is this all your most update-to-date code? If so, then we can just review this pull request.
— Reply to this email directly or view it on GitHubhttps://github.com/THUKEG/saedb/pull/69#issuecomment-20327322 .
@kimiyoung 's API of type builder looks good. I'll read the code and do some test.
@thinxer , @kimiyoung as for serialization, how do we store variable numbers of types of variable length with mmap? I remember your idea is create a mmap for each type and record some index information. which stage are we now?
My previous serialization code is serialize one vertex data and one edge data, does not support Heterogeneous data. So I think the missing part is an index about the whole graph's type information, the location information(in which mmap file) of each vertex and edge's data.
Oh, I saw some new fields like global_id
and local_id
, are these the index?
In fact, no progress on storing the serialized data. I think we can just Load and Save the dynamic data into a normal file, without mmap. Please investigate how to do that.
For a vertex, global_id
is the index in the whole vertex list, while local_id
is the index in the certain type. It holds as well for edges.
next is merge with serialization #59 .
I need "InEdgesBySourceType" and "OutEdgesByTargetType" for VertexIterator, as well as "InEdgesByType" and "OutEdgesByType". Do you guys have simpler interface design for this kind of query?
Or we can just modify the InEdges, for example, like this: InEdges(int edge_type_mask, int source_type_mask), while masks are bitmaps indicating the wanted types.
This way, we can support up to 32 types, which i think is enough.
OutEdgesByType
is now available; plz refer to the interface OutEdgesOfType
. And InEdgesByType
can be implemented in a similar way, however, not available for the moment.
To support direct query for InEdgesBySourceType
and OutEdgesByTargetType
, maybe we need to sort edges by source type and target type respectively, map them to two independent files and build indices?
Yes. But for now, we can just iterate through all edges, and filter out those edges we don't need.
@kimiyoung , about serialization, we can just change MappedGraphImpl
's vdata_file
, edata_file
from unique_ptr<MMapFile> *
to unique_ptr<File> *
. and do serialization is easy, just write binary into related file.
the tricky part is deserialization, char ** vertex_data
, char ** edge_data
should stay same in GraphData
, while the meaning is different. it's actually vector<data_type_i>*
when we reference vertex_data[i]
. and when we load the graph data, user have to associate each data_type_rank with its real c++ class type through some API. then we can get the type information to deserialize each vertex data file. In order to maintain consistency, we should manually give data_type_rank when we build the graph.
as follows:
graph.associate<VData>(1);
graph.associate<VData2>(2);
and in associate<data_type>(data_rank)
:
for i in 1 to count :
data_type t;
cin >> t;
(vertex<data_type> *)vertex_data[data_rank]->push_back(t);
how do you think? according to your familiarity with MappedGraph
?
@pondering To do serialization, it seems that we still need to know the exact type, otherwise we don't know how to serialize the data.
May it be feasible to store every type of data as std::string
in the graph, and provide serialize/deserialize API for users.
@kimiyoung , yeah, user provide the exact type by associate
. once user called associate
for each data type, each data type has their own associate
implementation.
The serialization/deserialization logic(reload >>
operator for them) is provided by user in the namespace custom_serialization_impl
for their data types. And the reloaded >>
will be called in corresponding associate
.
for instance:
struct VData {...};
struct VData2 {...};
namespace custom_serialization_impl {
template <>
struct deserialize_impl<ISerializeStream, VData> { };
template <>
struct serialize_impl<ISerializeStream, VData> { };
template <>
struct deserialize_impl<ISerializeStream, VData2> { };
template <>
struct serialize_impl<ISerializeStream, VData2> { };
}
associate<VData>(0);
associate<VData2>(1);
associate<VData>(0)
and associate<VData2>(1)
are calling different functions.
the cin >> t
in associate<VData>(0)
will call struct deserialize_impl<ISerializeStream, VData>
, while the cin >> t
in associate<VData2>(1)
calls struct deserialize_impl<ISerializeStream, VData2>
.
do you think this logic is clear enough? any further simplification?
Users should associate
twice, right? One for Save and one for Load.
I think we need only once `associate'. We can register both serialize/deserialize functions at the same time.
@kimiyoung , @thinxer , in serialization, we need not call associate
, previous code is enough, right? I guess..
No it is not enough. We have no idea about the serialization/deserialization code of a type (I mean a type id or a type name string) in runtime. We have register them at run time for automatically serialization/deserialization.
@thinxer I agree. But how to "register" a bunch of template functions?
I mean is it possible to store the info like "which type matches which rank".
@thinxer , yeah, i'm wrong. the same as @kimiyoung 's question. in template functions, We can construct the functions we may use later and save them into closures, which is weird.
Maybe unified storage as stringstream
or string
is more elegant and easier to implement.
Serialization/deserialization are only needed when modifying/accessing data. What do you think?
Though it may lead to inefficiency.
No don't store function templates (which is impossible). We store the templated functions.
I see the problem. It's not possible to store different types of functions into a map.
I suggest that we modify serialization functions to make them accept "void " and deserialization functions to make them output "void ". This way we can have a unified function signature.
Or we can just call the typed serialization/deserialization functions in a lambda, as @pondering suggests.
something like this:
associate<T>(type) {
serialize_map[type] = [&](void* t){return serialize( *((T*)t) );}
deserialize_map[type] = [&](stringstream s){return (void*) deserialize(t)}
}
Or we just let the users to the dirty work, like @kimiyoung suggests.
What do you guys say?
Merged.
You may run the
mgraph_test
to see the heterogeneous features. However, app programs should be modified to adjust to new API, which is not yet done.