JohnLonginotto / ACGTrie

A custom Numpy Trie for storing counts of sequences
8 stars 0 forks source link

Example warp table in OUTPUT #3

Open dbrowneup opened 7 years ago

dbrowneup commented 7 years ago

Hello, I think your algorithm is very interesting and I'm trying to understand how it works. In the OUTPUT file, you give an example table of warp pipes for the string 'ACGACCC':

                                      row  A   C   T   G COUNT SEQ                                           
                                     ----------------------------------                                         
                                       0   1   2   0   3   7                                                   
                                       1   0   5   0   4   2   C                                             
                                       2   0   7   0   6   4                                                   
                                       3   0   0   0   0   1   ACCC                                          
                                       4   0   0   0   0   1   ACCC                                          
                                       5   0   0   0   0   1   C                                             
                                       6   0   0   0   0   1   ACCC                                          
                                       7   0   8   0   0   2                                                 
                                       8   0   0   0   0   1                                                 

But unless I don't understand this correctly, I think there is an error. The above table seems to indicate the presence of a AG 2-mer (with SEQ ACCC) via the path of rows 0 -> 1 -> 4.

I think this was supposed to be a GA 2-mer, encoded in the table as follows:

                                      row  A   C   T   G COUNT SEQ                                           
                                     ----------------------------------                                         
                                       0   1   2   0   3   7                                                   
                                       1   0   5   0   0   2   C                                             
                                       2   0   7   0   6   4                                                   
                                       3   4   0   0   0   1   ACCC                                          
                                       4   0   0   0   0   1   CCC                                          
                                       5   0   0   0   0   1   C                                             
                                       6   0   0   0   0   1   ACCC                                          
                                       7   0   8   0   0   2                                                 
                                       8   0   0   0   0   1                                                 

With the path 0 -> 3 -> 4 and SEQ CCC.

On top of that, I am wondering if the table should also have a warp pipe in row 4, column C, so as to store counts for GAC?

Or since SEQ contains CCC, can we assume that GAC, GACC, GACCC will all return a count of 1? In which case row 4 would be unnecessary, as this information would be contained in row 3.

Is the k-mer ACG account for in this table?

Thanks!

JohnLonginotto commented 7 years ago

Hey man, thanks for the question :)

So i get this particular one a lot - the reason is that you must "consume" all SEQ before being able to use ANY pipe. So in this case it's not ()->(A)->(G) but ()->(AC)->(G) or ()->(A)->(CG) depending on how you want to think about it. The COUNT is true for all intermediaries too, so in the row with: 3 4 0 0 0 1 ACCC that's saying there's a count of 1 for G, GA, GAC, GACC, and GACCC. Only once all of those letters are consumed can you use a pipe. If a GAT comes along with a count of 1, the row SEQ here would be changed from "ACCC" to "A", with a T pipe and a C pipe. The T pipe will point to a row with no SEQ. The C pipe will point to a row with the SEQ "CC", as one "C" is encoded with the pipe itself.

That's the part i think is the least intuitive about the output files. Hopefully this very direct question/answer will help others understand it better too, so thank you :)

Any more questions feel free to ask!

dbrowneup commented 7 years ago

Thank you for the clarification, that definitely helps me understand the table better. I'd like to use this approach to count k-mers and so what I'm really wondering is how I could iterate over the table and say, construct a set of all 5-mers found in the sequences. It would be nice if it's something simple like:

kmer_set = warp_table.count_kmers(k=5, sample_size=100000)

Where the warp table is implemented as a Class and has some functions to perform operations on the table. An option to control sample size would be useful to help control memory usage, if there's lots of k-mers encoded in the table. Multiple random samples of the table could also give us insight into the variability of the k-mer content.

JohnLonginotto commented 7 years ago

Yeah, you're absolutely right - as it currently stands, ACGTrie doesn't offer much in the way of helper functions after you've made a table. A count function is surely a top candidate for such an addition, and i like your proposed syntax. In fact, the idea of taking all the various parts of an ACGTrie table and wrapping it into a single class with a bunch of class methods hanging off it is really great. I'll have to think of a way to make that modular though, as it would be nice if someone could create a new method as a module to this class, rather than submit an update to the class itself.

The only change i would make to your proposal to count_kmers is to have a "k-min=5, k-max=5" rather than a single k=5, as being able to cover a range of sizes of k is really ACGTrie's main selling point/niche. This is because when counting k=5, you visit all the nodes for k=4, k=3, k=2, etc, so you can get counts for those quite cheap. You'd probably want to be able to give an arbitary list of k-mer sizes too, like:

kmer_set = warp_table.count_kmers(k=[3,5,9,12,15], sample_size=100000)

So yeah, no k-min/max, just k=int/[list]/(tuple). Then k=range(1,6) would do it.

The sample_size is going to be a bit tricky to implement, as if it's just a limiter for nodes visited, then we need to think about the different ways in which trie's can be walked - depth-first, as in keep going down until we hit an end/leaf node, then back up a level, or, fully enumerate a level before going down to the next size of k. Perhaps there are even some other ways I haven't thought of, like A/C/G/T-balanced or something. This will be a universal issue with ACGTrie analyses, so it should be a part of the vocabulary, e.g.

kmer_set = warp_table.count_kmers(k=5, sample_size=100000, walk=acgtrie.depth_first)

otherwise, warp_table.count_kmers(sample_size=100) could give very different results depending on the shape of the trie itself, and so comparisons of limited counts between shallow/deep tries might be misleading. Or maybe enlightening. Depends on the user's question i guess.

There is also the issue of, do we want to sum up raw counts of things, or do we want to sum up the percentage frequencies (relative counts). so if "ACCGTTGGT" has a COUNT of 15, that could be represented as "15", or as a % of all counts ("0.000000134%"), a % of all counts where the k-mer size is 9 ("0.242%"), or as a % of the root node ACCGTTGG ("25%"), etc. So we'd either want a count function to handle all that too, or make different count functions for those types of count...? Not sure at this stage.

It's been a while since i worked on ACGTrie. Of all the programs i've written for my PhD, ACGTrie is my personal fav. and so I was saving it until the end to really develop fully. It could be another month or two until I get around to all this. I think i'll start with that wrapper class idea of yours though, as i can pile all these low-level functions into that, and then you could write a counter in just a few lines of code like:

warp_table.count(value='COUNT', k=range(10,20), walk=acgtrie.depth_first(limit=100000))

Looks kind of messy though. hm. What are your thoughts?

EDIT: in fact, maybe I just work on a walker generator, that you provide a walking method, limit and k-mer sizes to return:

walker = warp_table.seq_walk(method='depth_first', filter={k:range(5,10)}, limit=100000, return='COUNT')

for x in walker: print x
('ACCGT',1502)
('ACCGTT',802)
('ACCGTTG',302)
('ACCGTTGG',37)
('ACCGTTGGT',15)
...

From which a counter module would be as easy as:

def counter(walker):
    tally = collections.defaultdict(int)
    for seq,count in walker:
        tally[len(seq)] += count
    return tally

Yeah this really needs some thought I guess :)

dbrowneup commented 7 years ago

I like your idea of having a wrapper class for the warp table and a separate class to contain the methods for analysis/traversal. That would allow you to separate the core ACGTrie code from the various applications that people come up with. Plus, if the warp table is easily accessible from a class, like...

from ACGTrie import WarpTable

We could do things like count k-mers, sample the table, etc. Depth-first searches of the table could be one method of finding k-mers. I'm more interested in a breadth-first approach, where perhaps random nodes are selected for extension, if we are to sub-sample to complete set of nodes. It would be interesting to compare k-mer samples generated through either depth-first or breadth-first searches and see if/how the sampling method affects sample composition.

What I'd really like to do though is build some methods for comparing 2 or warp tables. Maybe some generators to iterate over the table and determine the Jaccard index across a range of k-values. We could also probably detect enrichment/depletion of specific k-mers in a set of sequences.

As for the expression of count values (i.e. absolute, relative, etc), I think it might be good to have the ability to specify which type of value to return. So maybe something like...

from ACGTrie import WarpTable, KmerCounter

warp_table = WarpTable("Sequences.fa", max_k=100000)
kmer_counter = KmerCounter(warp_table, k=range(10, 30), sample_size=100000, value='absolute')

result = list()
for kmer_set in kmer_counter:
    result.append(analyze(kmer_set))

If you're interested, I've been working on some ideas to which ACGTrie is quite relevant, and I'd be very glad to have your thoughts. My email is dbrowne@tamu.edu if you wanna drop me a line, then I can share some documents with you.

JohnLonginotto commented 7 years ago

What I'd really like to do though is build some methods for comparing 2 or warp tables.

Yeah for sure! That's where all the really interesting biological insight is going to come from. The comparison of sequencing data, not by genomic location, but by pure DNA content. To this day it blows my mind that this doesn't already exist, but i'm confident we can get something that works in the coming months.

Right now i'm writing up two other previously half-finished projects (https://github.com/JohnLonginotto/pybam & https://github.com/JohnLonginotto/uq), so it will be about 2 weeks until I can come back to ACGTrie and refactor it to be something usable like pybam and uq. But am definitely interested in hearing your thoughts before I do so, as we've already made some good progress i think, and certainly it's always good to optimise a project for a real-world use :) So i'll e-mail as soon as i've cleared my brain-cache of all non-ACGTrie stuff, which i think will be about two weeks, maybe less :)

dbrowneup commented 7 years ago

Sound great man, I look forward to talking more about this in a couple weeks!

On Sun, May 7, 2017 at 8:41 AM, JohnLonginotto notifications@github.com wrote:

What I'd really like to do though is build some methods for comparing 2 or warp tables.

Yeah for sure! That's where all the really interesting biological insight is going to come from. The comparison of sequencing data, not by genomic location, but by pure DNA content. To this day it blows my mind that this doesn't already exist, but i'm confident we can get something that works in the coming months.

Right now i'm writing up two other previously half-finished projects ( https://github.com/JohnLonginotto/pybam & https://github.com/ JohnLonginotto/uq), so it will be about 2 weeks until I can come back to ACGTrie and refactor it to be something usable like pybam and uq. But am definitely interested in hearing your thoughts before I do so, as we've already made some good progress i think, and certainly it's always good to optimise a project for a real-world use :) So i'll e-mail as soon as i've cleared my brain-cache of all non-ACGTrie stuff, which i think will be about two weeks, maybe less :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JohnLonginotto/ACGTrie/issues/3#issuecomment-299707087, or mute the thread https://github.com/notifications/unsubscribe-auth/AE81-XAsQTPOe9vQjYcaOFEjZMSBkYEtks5r3cn0gaJpZM4M-vcT .