option to exclude k-mers with zero counts?

jmcalabr commented 3 years ago

This is not really an issue, just a suggestion -- hope all is well!

We are finding new ways to use SEEKR by specifying custom alphabets. In these experiments, we are encountering words or k-mers that never occur, even in the entire mouse or human transcriptome.

Here (and probably in other use cases as well), it would be useful if seekr_norm_vectors and seekr_count_kmers could take an input flag that would cause them to exclude k-mers with "0" counts across all counted RNAs. Or even a flag that would allow exclusion of k-mers that do not exceed a threshold of variance across the counted sets of RNAs.

Thank you and take care, Mauro

Jessime commented 3 years ago

Hey Mauro, I've got lots of thoughts about this one!

At the highest level, I think you're reaching the limit of what it makes sense to expose at the command line, as opposed to writing some simple python scripts and having the flexibility that comes along with that.

by specifying custom alphabets

This is neat! Did you know that I've already got support for custom alphabets? You can see on this line, that "AGTC" is just the default. You could count protein kmers if you wanted, haha.

Inside of a python script, this would just be, for example

counter = BasicCounter('gencode.fa', alphabet='AGTCWS')
counter.get_counts()

input flag that would cause them to exclude k-mers with "0" counts across all counted RNAs.

The generalization of what you're asking here is, "Can we filter kmers?"

As I'm sure you remember, a big surprise for us was that we couldn't find a subset of kmers that gave us more predictive power than the full set of 6mers. Part of the reason it was so surprising was because I spent a ton of time trying. You've listed a few here:

max(count_column) == 0
max(count_column) < threshold
variance(count_column) < threshold

I can think of several dozen more that we tried over the years. A few were:

Picking at random
GC content of kmer
Covariance with other kmers
Mean of counts
More complicated machine learning attempts

I say all of this to say, I explicitly chose not to build something like seekr_filter_counts because there are so many options and it wasn't clear which of them were worth the effort to build into the official package.

On the other hand, a single filter is pretty easy from within python. if you've run seekr_count_kmers, you could do:

def has_vals(col):
    return max(col) > 0   # your custom filter

df = pd.read_csv('kmers.csv', index_col=0)  # read
filtered_df = df.loc[df.apply(has_vals)]  # apply filter
filtered_df.to_csv('filtered_kmers.csv')  # write

which would exclude k-mers with zero counts. And notice that only the line commented as # your custom filter is actually interesting.

What might make sense is to create a seekr_filter_counts that allows users to specify an arbitrary filter function. That'd allow you to incorporate filtering into your your bash scripts, but wouldn't make it necessary to curate all of your favorite filters and go through me each time you want to try a new filter. This would still require someone to write the python filters, of course, but it sounds like the ones you're thinking about right now are easy enough that someone in lab could write them up without too much trouble.

To summarize:

There are 100s of filters one might apply; it probably doesn't make sense to start adding them to seekr right now.
It's easy to write one off scripts to filter columns.
If you're really interested in having filtering as part of the bash scripting pipeline, we should talk more.

Hope this is helpful! 😄

jmcalabr commented 3 years ago

Thanks Jessime!!

That all makes a ton of sense. To fill in some more details, we started using the custom alphabet function to count “k-mers” in dot-and-bracket RNA structure predictions, such as those that come out of prediction algorithms like RNAfold – for example:

1 CAAUUGUAAGUGACCACAAGCAUGCAUCUUGGACAUUUAUGUGCGUAAUCGCACACUGCUCAUUCCAUGUGAAUAAGGUCCUACUCUCCGACCCCUUUUGCAAUACAGAAGGGUUGCUGAUAACGCAGUCCCCUUUUCUUGGCAUGUUGUGUGUGAUUAU 160 AAUCGUCUGGGAUCCUAUGCACUAGAAAAGGAGGGUCCUCUCCACAUACCUCAGUCUCACCUUUCCCUUCCAGCAGGGAGUGCCCACUCCAUAAGACUCUCACAUUUGGACAGUCAAGGUGCGUAAUUGUUAAGUGAACACAACCAUGCACCUUAGACAU 320 GGAUUUGCAUAACUACACACAGCUCAACCUAUCUGAAUAAAAUCCUACUCUCAGACCCCUUUUGCAGUACAGCAGGGGUGCUGAUCACCAAGGCCCUUUUUCCUGGCCUGGUAUGCGUGUGAUUAUGUUU 1 .....(((((((.(((.............))).)))))))..((((((((((((.....((((.....))))....((((.........)))).....(((.....(((((((..((((......)))).)))))))....))).((((((((((.(((( 160 ....((((((((((((...............)))))))).((((........((((((((...(((((......))))))))...........)))))........))))..((((((((((((..((((.........))))..))))))))).))).. 320 ))))....)))).))))))))))(((.......)))..............((((((((((.(((.....))).)))))).))))..((((.((((.........))))))))....))))))))))))..

In these predictions, some structures are disallowed – for example, a 4-mer ‘()()’ – and for this reason they will always register with 0-counts.

Definitely something that we can code in-line in python, but I was thinking it might be useful to have an “exclude-zero-counts” option, mostly for the RNA structure problem, but also when folks happen to be analyzing small sets of sequences.

In any case, I appreciate your design perspective -- if you think this is too granular, I am happy to go with that. If you get bored and want to include it in a SEEKR update, I am happy to go with that too. If this structural k-mer idea holds up over time, we may add some additional functions to SEEKR anyway, with probably a better perspective than I have now.

I really appreciate you continuing to take care of SEEKR.

Hope you and the family are well and everything is going well with work and life in SF too.

Take care and happy holidays, Mauro

From: Jessime Kirk notifications@github.com Reply-To: CalabreseLab/seekr reply@reply.github.com Date: Wednesday, November 25, 2020 at 10:29 PM To: CalabreseLab/seekr seekr@noreply.github.com Cc: "Calabrese, Mauro" mauro_calabrese@med.unc.edu, Author author@noreply.github.com Subject: Re: [CalabreseLab/seekr] option to exclude k-mers with zero counts? (#20)

Hey Mauro, I've got lots of thoughts about this one!

At the highest level, I think you're reaching the limit of what it makes sense to expose at the command line, as opposed to writing some simple python scripts and having the flexibility that comes along with that.

by specifying custom alphabets

This is neat! Did you know that I've already got support for custom alphabets? You can see on this linehttps://github.com/CalabreseLab/seekr/blob/master/seekr/kmer_counts.py#L100, that "AGTC" is just the default. You could count protein kmers if you wanted, haha.

Inside of a python script, this would just be, for example

counter = BasicCounter('gencode.fa', alphabet='AGTCWS')

counter.get_counts()

input flag that would cause them to exclude k-mers with "0" counts across all counted RNAs.

The generalization of what you're asking here is, "Can we filter kmers?"

As I'm sure you remember, a big surprise for us was that we couldn't find a subset of kmers that gave us more predictive power than the full set of 6mers. Part of the reason it was so surprising was because I spent a ton of time trying. You've listed a few here:

max(count_column) == 0
max(count_column) < threshold
variance(count_column) < threshold

I can think of several dozen more that we tried over the years. A few were:

Picking at random
GC content of kmer
Covariance with other kmers
Mean of counts
More complicated machine learning attempts

I say all of this to say, I explicitly chose not to build something like seekr_filter_counts because there are so many options and it wasn't clear which of them were worth the effort to build into the official package.

On the other hand, a single filter is pretty easy from within python. if you've run seekr_count_kmers, you could do:

def has_vals(col):

return max(col) > 0   # your custom filter

df = pd.read_csv('kmers.csv', index_col=0) # read

filtered_df = df.loc[df.apply(has_vals)] # apply filter

filtered_df.to_csv('filtered_kmers.csv') # write

which would exclude k-mers with zero counts. And notice that only the line commented as # your custom filter is actually interesting.

What might make sense is to create a seekr_filter_counts that allows users to specify an arbitrary filter function. That'd allow you to incorporate filtering into your your bash scripts, but wouldn't make it necessary to curate all of your favorite filters and go through me each time you want to try a new filter. This would still require someone to write the python filters, of course, but it sounds like the ones you're thinking about right now are easy enough that someone in lab could write them up without too much trouble.

To summarize:

There are 100s of filters one might apply; it probably doesn't make sense to start adding them to seekr right now.
It's easy to write one off scripts to filter columns.
If you're really interested in having filtering as part of the bash scripting pipeline, we should talk more.

Hope this is helpful! 😄

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CalabreseLab/seekr/issues/20#issuecomment-734052212, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALDCYUQH2Y4MO767MMMT5Z3SRXDPFANCNFSM4UCLSSFQ.

Jessime commented 3 years ago

This is off topic from the original question, but I'm curious if you've considered creating a third alphabet by mapping base by structure? You could have an alphabet that looks like:

{
'A.': 'a',
 'A(': 'b',
 'A)': 'c',
 'G.': 'd',
 'G(': 'e',
 'G)': 'f',
 'U.': 'g',
 'U(': 'h',
 'U)': 'i',
 'C.': 'j',
 'C(': 'k',
 'C)': 'l'
}

so the first 30 letters of the sequence you gave as an example in your message would be new_seq = 'aagjdhkheeebgkkhagdjajgadaaaaf' The parameter space is pretty big, but you could reasonably do 3mers or 4mers 12^4 ~= 21,000 like this.

Anyway, I'll keep the --exclude-zero-counts suggestion in the back of my mind. Work is pretty busy at the moment, but I'll continue to mull it over. And if I find a slow weekend, I'll see if I can give something a shot.

Cheers!

jmcalabr commented 3 years ago

Thanks Jessime – I think that might be too big of an alphabet – the two situations we are trying to model are places where the same set of k-mers adopts different predicted structures, and alternatively, if different sets of k-mers adopt similar predicted structures. We will be performing limited tests at first, to see if the structural predictions get us any info above random. I’m thinking that we will get a positive result (eventually) and that this type of analysis might be used to augment nucleotide-based k-mer similarity searches.

But right now, I’m just trying to recruit computational people to the lab!! Rotation students are working on these problems.

Glad to hear work is busy – I could imagine! Exciting times to be an RNA biologist – the CSO at Moderna came and gave a talk at UNC in September – very uplifting.

Hope all is well, Mauro

From: Jessime Kirk notifications@github.com Date: Friday, November 27, 2020 at 8:38 PM To: CalabreseLab/seekr seekr@noreply.github.com Cc: Calabrese, Mauro mauro_calabrese@med.unc.edu, Author author@noreply.github.com Subject: Re: [CalabreseLab/seekr] option to exclude k-mers with zero counts? (#20)

This is off topic from the original question, but I'm curious if you've considered creating a third alphabet by mapping base by structure? You could have an alphabet that looks like:

{

'A.': 'a',

'A(': 'b',

'A)': 'c',

'G.': 'd',

'G(': 'e',

'G)': 'f',

'U.': 'g',

'U(': 'h',

'U)': 'i',

'C.': 'j',

'C(': 'k',

'C)': 'l'

}

so the first 30 letters of the sequence you gave as an example in your message would be new_seq = 'aagjdhkheeebgkkhagdjajgadaaaaf' The parameter space is pretty big, but you could reasonably do 3mers or 4mers 12^4 ~= 21,000 like this.

Anyway, I'll keep the --exclude-zero-counts suggestion in the back of my mind. Work is pretty busy at the moment, but I'll continue to mull it over. And if I find a slow weekend, I'll see if I can give something a shot.

Cheers!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CalabreseLab/seekr/issues/20#issuecomment-735025226, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALDCYUWJFF2BMCFQGDN23OLSSBH7ZANCNFSM4UCLSSFQ.

dan-sprague commented 2 years ago

Mauro,

Actually working on very similar problems in my research for the company I work at. Not sure if you remember but I work in the same startup incubator that Moderna started in! Believe it or not, still partially working on ncRNAs, kmers, and HMMs.

I don't have much to say for python implementation of what you are asking, other than if Jessime has had time to look at this yet?

As for the science, I have a couple thoughts that I am able to share:

1) Check out this paper that developed an encoding for secondary structure that is more information-rich than simple dot bracket notation. Rather than bound/not bound, it encodes information about stems, loops, bulges, their length, etc. The encoding returned is the same length as the dot-bracket and therefore the same length as the original sequence. With this you should be able to identify specific structures and then ask the question whether you see the same k-mers enriched in them, or not:

https://pubmed.ncbi.nlm.nih.gov/24753415/

2) Using k-mer frequencies and the alphabet you described, I think you could reduce the alphabet size for "free" by using binary bound/not bound alphabet instead of dot-bracket. Since k-mers lose positional information the difference between "(" and ")" is negligible.

3) RNA structures are hard to work with computationally. I'd recommend looking into using the MFE from RNAfold as a covariate, i.e. if you tile across a long RNA and calculate the structure and its MFE at each position, you can average the MFE over all tiles to get a number for each k-mer. Even the data behind the mountain plot may likely be more useful, since that encodes some information about stem length and overall structure.

4) If you want to attract people to the lab by flashing "DEEP LEARNING", I think the field has come sufficiently far that I am quite certain that it is not a waste of time. Specifically transfer learning, which is the ability to train a large model on a closely related dataset, then "transfer" that model for use on a much, much smaller dataset. A deep language model can be trained to embed RNA structures into a feature vector of arbitrary dimension, much how you are embedding lncRNA sequences into a kmer feature vector. The reason to consider this is that this "embedding" will correctly model the secondary structure. i.e. the model will learn that .((..(...)..)). is not the same as (..)..((...)).. (or whatever). To put it in SEEKR terms, they would be uncorrelated or negatively correlated. Whereas with a k-mer alphabet, you'd see very high correlation.

RFAM has ~4M sequences, although random RNA sequences (arbitrarily high number) would work just fine to train such a model. This is what I am referring to in terms of a closely related dataset.

As a final note for (4), the other recent development in this field is bayesian inference on deep networks that help estimate uncertainty. There are now straightforward ways to know if a NN is "confidently wrong" about something, perhaps the most important paper being:

https://arxiv.org/abs/1506.02142

Because it is so damn easy to do.

jmcalabr commented 2 years ago

Thanks Dan! I appreciate this feedback – great minds think alike! This is exactly the direction I want us to head in…just need to convince someone to take a “risk” and join my group to do it. Recruiting a computational student to the group after you and Jessime left has been a lot harder than I envisioned it would be – to that end, your point about advertising DEEP LEARNING is well-taken (along with the other points – they are well-taken too). Very happy to hear that your UNC training has proven useful. Always happy to hear updates on your success too, so don’t hesitate to keep me in the loop.

Also if you are in the mood to make tweaks to SEEKR, know that I appreciate it.

Good luck with everything and take care- Mauro

From: Dan Sprague @.> Reply-To: CalabreseLab/seekr @.> Date: Wednesday, October 6, 2021 at 1:30 PM To: CalabreseLab/seekr @.> Cc: Mauro Calabrese @.>, Author @.***> Subject: Re: [CalabreseLab/seekr] option to exclude k-mers with zero counts? (#20)

Mauro,

Actually working on very similar problems in my research for the company I work at. Not sure if you remember but I work in the same startup incubator that Moderna started in! Believe it or not, still partially working on ncRNAs, kmers, and HMMs.

I don't have much to say for python implementation of what you are asking, other than if Jessime has had time to look at this yet?

As for the science, I have a couple thoughts that I am able to share:

Check out this paper that developed an encoding for secondary structure that is more information-rich than simple dot bracket notation. Rather than bound/not bound, it encodes information about stems, loops, bulges, their length, etc. The encoding returned is the same length as the dot-bracket and therefore the same length as the original sequence. With this you should be able to identify specific structures and then ask the question whether you see the same k-mers enriched in them, or not:

https://pubmed.ncbi.nlm.nih.gov/24753415/

Using k-mer frequencies and the alphabet you described, I think you could reduce the alphabet size for "free" by using binary bound/not bound alphabet instead of dot-bracket. Since k-mers lose positional information the difference between "(" and ")" is negligible.
RNA structures are hard to work with computationally. I'd recommend looking into using the MFE from RNAfold as a covariate, i.e. if you tile across a long RNA and calculate the structure and its MFE at each position, you can average the MFE over all tiles to get a number for each k-mer.
If you want to attract people to the lab by flashing "DEEP LEARNING", I think the field has come sufficiently far that I am quite certain that it is not a waste of time. Specifically transfer learning, which is the ability to train a large model on a closely related dataset, then "transfer" that model for use on a much, much smaller dataset. A deep language model can be trained to embed RNA structures into a feature vector of arbitrary dimension, much how you are embedding lncRNA sequences into a kmer feature vector. The reason to consider this is that this "embedding" will correctly model the secondary structure. i.e. the model will learn that .((..(...)..)). is not the same as (..)..((...)).. (or whatever). To put it in SEEKR terms, they would be uncorrelated or negatively correlated. Whereas with a k-mer alphabet, you'd see very high correlation.

RFAM has ~4M sequences, although random RNA sequences (arbitrarily high number) would work just fine to train such a model. This is what I am referring to in terms of a closely related dataset.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/CalabreseLab/seekr/issues/20#issuecomment-936749041, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALDCYUVXG52SXOBFH3FHHM3UFSBTZANCNFSM4UCLSSFQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

CalabreseLab / seekr

option to exclude k-mers with zero counts? #20