bio4j / dynamograph

GSoC 2014 project - a DynamoDB based graph DB
GNU Affero General Public License v3.0
4 stars 1 forks source link

Initial design for DynamoDB table layout for GO #10

Closed alberskib closed 10 years ago

alberskib commented 10 years ago

@bio4j/dynamograph I propose next initial table layout for GO:

GO_TERM: Hash Primary Key
{
- id - String - Hash Primary Key,
- name - String
- namespace - String
- definition - String
- synonyms - Set of Strings
- comment - String
- cross_ref - Set of String
- subset - Set of String
}
GO_RELATIONSHIP: Hash Range Primary Key
{
- type - String
- source - String (Hash Primary Key)
- target - String (Range Primary Key)
}

I removed next values: 'secondary_ids' and 'synonyms' which are not of 'exact' type and obsolete.(according to) I decide to store exact synonyms as set of Strings as there is no a lot of synonyms for single term (if you know about some special case or you think that it will be better to create seperate table please let me know). I do not know whether solution should support Term by name/synonym. If so synonyms should be extracted to seperate table:

GO_NAME : Hash Range Primary Key
{
- Id - Hash Primary Key
- Name - Range Key
}

with index on name (in order to accelerate queries by name)

cross_ref - also placed as Set of String as I do not know a lof about it and it seems that there is no gigantic number of those references(please correct me I am wrong).

subset - again Set of String (If there is any additional requirement for subset like quering bu it or sth like this please let me know)

What is more single term does not have enormous amout of relationship (I think) so theoretically it could bo possible to place them in the same table as: 5 different fields( outgoing relationship) with type as Set of String i.e:

- is_a : Set of String
- part_of: Set of String
- has_part_of : Set of String
- negatively_regulates : Set of String  
- positively_regulates :  Set of String

but it will be difficult to find incoming relationship. If we place 'relationship` in seperate table we can use Global seconady index for retrieving incoming relationship.

All comments, discussion, opinions are more than welcome. Maybe you know something about special requirement, special usage, queries that should influence layout.

alberskib commented 10 years ago

I was thinking about preparing some transformation function for id: According to (link)[http://beta.geneontology.org/page/ontology-structure] id is zero-padded 7 digit string with GO: prefix. Thankt to is id could be stored as number by removing GO: prefix but such modification does not change anything (DynamoDB serializes Number to String)

eparejatobes commented 10 years ago

@alberskib cool I'll take a look at it tomorrow

eparejatobes commented 10 years ago

One general comment about indexes: it's better in general to avoid them. They have attribute limits, each table can have a limited number of them (5 I think) and they impose a restriction on the table size. In general it is always better to use other tables as indexes.

And about Set attributes: in 99% of cases they should not be properties but relationships.

Concerning relationships, the idea so to say is to have local indexes by default; for each rel R going X -[R]-> Y where X,Y are node types, we need direct access to

  1. rels r: out(x): R out of x: X
  2. rels r: in(y): R going to y: Y

For that a simple design that works nicely enough is

  1. a table for X out R with hash the X.id and range the R.id
  2. a table for Rs with hash R.id (range could be then a sort key for it, or anything else we see fit). Every rel of course has src and tgt ids X.id, Y.id
  3. a table for Y in R with hash the Y.id and range the R.id

Later this can be refined into more than one hash per node etc, with the map node -> hash stored in a different table etc. The key about this is that relationships are stored independently of nodes and in general accessing rels of a given type from a node would only depend on the number of rels with that type and orientation at that particular node.

About the data:

Anyway we can talk about this, maybe tomorrow? I'm in US EST

alberskib commented 10 years ago

Sure, we should discuss it. I am available tomorrow from 11AM to 5PM US EST. Taking into account your comments I prepare improvements.

alberskib commented 10 years ago

Generally I am thinking how we will tacke multiple different type of nodes from DynamoDb perspective. Database enables differenct count and type of parameters for each entry, so there is my question whether we would like to have single table named "Node" (or sth like this) or for every data type i.e GO, uniprot etc have seperate tables?(The same thing with relationships)

Taking into account all your comments initial layout should more or less like that:

//temporarily removed set attributes -> they will be stored in seperate table
GO_TERM_NODe: Hash Primary Key
{
- id - String - Hash Primary Key,
- name - String
- namespace - String
- definition - String
- comment - String
}

IN_RELATIONSHIP: Hash Range Primary Key
{
- id - String (Hash Primary Key)
- rel_id - Number (Range Key, GO_RELATIONSHIP id)
}

OUT_RELATIONSHIP: Hash Range Primary Key
{
- id - String (Hash Primary Key)
- rel_id - Number (Range Key, GO_RELATIONSHIP id)
}

GO_RELATIONSHIP: Hash Range Primary Key
{
- id - Number (Hash Primary Key)
- type - String 
- ... potentail other attributes
- source - String (identifier of source node)
- target - String (identifier of target node)
}
alberskib commented 10 years ago

How about meeting? Maybe tomorrow (20.05.2014)?

laughedelic commented 10 years ago

Meetings depend on @eparejatobes' availability. I guess, if we're trying not to use indices, then we should have a table per node type. But I don't actually know about this stuff.

eparejatobes commented 10 years ago

I have some free space tomorrow morning US EST time, I'll be online

alberskib commented 10 years ago

Unfortunately I am able to meet today (20.05) at 11AM - 5PM US EST. Alternativelly tomorrow(21.05) I will be available all day (with break 5AM - 7AM US EST)

eparejatobes commented 10 years ago

@alberskib I think I can find some time in that interval you mention, I'll write here

eparejatobes commented 10 years ago

@alberskib in 1h?

eparejatobes commented 10 years ago

sorry maybe 30min

alberskib commented 10 years ago

@eparejatobes Great