TheEvergreenStateCollege / smarty-plants

Plant genome sequencing
2 stars 2 forks source link

suffix trees for finding largest common substring #3

Open AbyssalRemark opened 4 months ago

AbyssalRemark commented 4 months ago

Suffix trees are a clever way of holding strings that allows us to do really fast string operations such as, longest common sub-string. As well as others. It might make sense to construct our data into suffix trees to allow us to construct the graph more efficiently.

Here is a picture from wikipedia

Suffix_tree_BANANA svg (lets see if that works)

I read a paper a while back talking about relaxing the definition of suffix trees to allow it to be even faster for the express purpose of mRNA seq (should be true for DNA too)

Will add that paper once I find it.

AbyssalRemark commented 4 months ago

This might be it. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929406/ This talks about a de novo which is assembling without a scaffold to build from. Which is in part what were doing.

tw-edu commented 4 months ago

Are you proposing we use Suffix Trees for the assembly? It seems the main paper in the repo proposes using de Bruijn Graphs so I'm just curious what the pros/cons would be with each method.

https://www.pnas.org/doi/10.1073/pnas.171285098

AbyssalRemark commented 4 months ago

Are you proposing we use Suffix Trees for the assembly? It seems the main paper in the repo proposes using de Bruijn Graphs so I'm just curious what the pros/cons would be with each method.

https://www.pnas.org/doi/10.1073/pnas.171285098

So. Yea thats what this paper is talking about. Thats the relaxed part of things. Its an option sure. But im more interested in them for there general ease of string operations.

If we're checking forwards and backward and inversed (and inversed backwards) then the cost of getting our data into a suffix tree (I hope) makes the string operations less time consuming. Upfront data storage cost to save on the countless comparisons per data points.

But, we should totally look into things more.

AbyssalRemark commented 4 months ago

Update. According to Nancy in an email she sent out today: Our data should all be pointing the same direction according to our extraction methods. If we have DNA then we still could have the inverse (I think), but if we do have RNA (which is only one strand) then we need only check once in which case we get no benefit from suffix trees. (I still might make em for funzies because there kinda cool)

learner-long-life commented 4 months ago

Nice find for suffix trees, I'm learning about them now. Is longest-common-substring (LCS) finding using them useful for both de novo assembly and read alignment? It seems so to me.