apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
226 stars 46 forks source link

feat(c++): supprt for multi-property data #660

Open Elssky opened 3 days ago

Elssky commented 3 days ago

Describe the enhancement requested

As TinkerPop hopes that GraphAr can be compatible with multi-property data, the specific descriptions are as follows

a multi-property is a bit like storing the values in a list for a
single key, but a multi-property has some particularities to it:
gremlin> g.addV().property('name',['alice','bob'])
==>v[0]
gremlin> g.addV().property(list,'name','craig').property(list,'name','dave')
==>v[2]
gremlin> g.V(0l).properties()
==>vp[name->[alice, bob]]
gremlin> g.V(2l).properties()
==>vp[name->craig]
==>vp[name->dave]

in the above example, the 0 vertex stores a alice and bob as a list while
the 2 vertex stores craig and dave as a multi-property. you can see that
TinkerPop treats the latter as two separate properties which leads to other
differences:

gremlin> g.V().has('name','alice')
gremlin> g.V().has('name','dave')
==>v[2]

In the previous example, alice can't be found because "name" has a List
object to match on for vertex 0. But we can find dave because vertex 2 used
a multi-property which stores it as a string on a individual property.

This may all be Gremlin semantics that i'm describing and might not have
any impact on how you choose to implement the ability to model
"multi-properties" for GraphAR, but I thought i'd clarify their behavior a
bit in case that helped.

Component(s)

C++

Elssky commented 3 days ago

Here is a yml example for multi-property, We introduce is_multiple field to distinguish a property whether it is multiple-property

# person.vertex.yaml
type: person
chunk_size: 1024
prefix: vertex/person/
property_groups:
  - properties:
      - name: id
        data_type: int64
        is_primary: true
        # primary property can not be multiple
        is_multiple: false
    prefix: id/
    file_type: csv
  - properties:
      - name: name
        data_type: string
        is_primary: false
        is_multiple: true
      - name: skill
        data_type: list<string>
        is_primary: false
        is_multiple: false
    prefix: name_age/
    file_type: csv
version: gar/v1

Given a vertex as follows

id|name|skill
2|'craig','dave'|'guitar','boxing'
# or like this 
# 2|'craig','dave'|['guitar','boxing']

In this example, the type of name is string, and the value of is_multiple is true, so if you search person with name 'craig' or'dave', you can get this vertex. The type of skill is list, and the value of is_multiple is false, if you search person with skill ['guitar', 'boxing'], you can get this vertex. However, if you search person with skill 'guitar' or 'boxing', it not works