TheEvergreenStateCollege / bioinformatics

Plant genome sequencing
2 stars 2 forks source link

DB queries to extend suffix trees #31

Closed learner-long-life closed 4 months ago

learner-long-life commented 4 months ago

Deprecated

After discussion with @AbyssalRemark it seems infeasible to update trees in the database. Rather, we should pivot to a design where we build the suffix tree in a sparse array, deallocate or mark as unused portions when we write them into db, defer operations on those deallocated subtrees, and then apply operations if we ever load those subtrees back into memory.

2024-07-16 Meeting Notes

Goal

Produce all necessary SQL queries for to update a suffix tree stored in a relational database that cover all three extension rules (UPDATE statements, not SELECT) in Ukkonen's algorithm, or show a counter-example suffix tree + extension that proves that this particular db approach may be impossible in general.

Background

Currently there are three extension rules to apply when ingesting a new (suffix) character in Ukkonen's algorithm.

The chief benefit of Ukkonen's is that this can be done in $O(n)$ time, for the $n$th character in an input string, as opposed to $O(n^2)$ or $O(n^3)$.

In our goal of persisting our suffix tree into a DB for partial construction and partial querying, this issue addresses the question:

If each edge in an existing suffix tree is represented as a row in the following tentative database schema:

model InputString {
  id          Int    @id @default(autoincrement())
  inputString String
}

model Node {
  id          Int    @id @default(autoincrement())
  parentId    Int
  childId     Int
  edgeString EdgeStringFragment[]
}

model EdgeStringFragment {
  id          Int    @id @default(autoincrement())
  stringFragment String
}