GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
206 stars 62 forks source link

Autogenerate GC content track for refseq #1631

Closed cmdcolin closed 1 year ago

cmdcolin commented 3 years ago

It seems like it would be useful to automatically create a GC content for a reference sequence

I thought perhaps we could synthesize one automatically, but this might be difficult. Actually showTrack doesn't like auto-generated track configs that aren't actually part of the tree because it uses resolveIdentifier

My thinking alternatively is instead of auto-generating it is we could add a convenience function to the CLI, either add-assembly or add-track, that automatically makes a GC content track

cmdcolin commented 3 years ago

Ref #281

jjrozewicki commented 3 years ago

I've been pretty deep on this issue for a couple weeks now, and my advice is that you should be careful with this.

The previous GC content plugin for JBrowse1 counted ambiguous nucleotides (N) in the denominator and also may have had an off-by-one error. Obviously I don't think any code that ships with JBrowse2 is going to have an off-by-one problem, but I do want to emphasize that there is value in that calculation being either dynamic or very configurable. Different people are going to want different counting behavior for uppercase vs. lowercase or whether or not N's are considered.

The plugin I've been working on is a general dynamic Nucleotide Content feature that is highly configurable for ATGC vs atgc vs N, and then also for average vs. skew.

So, my recommendation here is that even if the plugin cannot be dynamic, we should try to build some indexing solution so that the calculation itself can be dynamic based on the needs of different people in different labs.

cmdcolin commented 3 years ago

Definitely good to note. I'd like to have a "blessed" gc plotting feature, so definitely want to make it as accurate and flexible as possible. We have the early version of the GCContentAdapter on our master branch now, if you want to see about that...would be great to get feedback. Demo link http://s3.amazonaws.com/jbrowse.org/code/jb2/master/index.html?config=test_data%2Fconfig_demo.json&session=share-hjXAmPV8iX&password=D4dl3

Some of those features like windowsize, counting Ns or lower case, are not on master, but would be happy to incorporate any changes

jjrozewicki commented 3 years ago

I took a stab at extending the GC Content plugin, and it has been useful so far for genome visualization in my lab.

The key things I added:

Customizable counting by regex is quite useful because it's possible for users to completely control the behavior. As well, it allows enhanced plotting of soft-masked assemblies (e.g. plotting repeat density by customizing the regex to count lowercase).

I have attached the source for my version to this post. It works well enough for my lab's purposes right now, but would need to be cleaned up, have tests added, and be made more user friendly before it could be properly released.

There's a demo here: https://degeneratestrategy.com/nuccontent/web/

src.zip

cmdcolin commented 3 years ago

@jjrozewicki that is not only awesome but also groundbreaking :) I believe it's the first third party jbrowse 2 plugin I've seen! great work

also really good reference implementation for both gc content and repeat density