Make histogram density tracks for feature tracks

cmdcolin commented 4 years ago

Different track types need histograms generally because otherwise it incurs too much data downloading

NCList
BAM
Tabix

If we don't have this or can't get valid estimated feature densities, we can display "Zoom in to see data" similar to the sequence track

This falls under a larger category of "semantic zooming" but it's primarily needed to avoid large amounts of data downloads.

rbuels commented 4 years ago

We'll need to add things to the data adapter API to support this.

cmdcolin commented 4 years ago

Xref #771 for initial work on feature density estimation

laceysanderson commented 4 years ago

+1 for this feature :-) My JBrowse is loading really slow when zoomed out due to too many features

Tom-Shorter commented 3 years ago

Is there an ETA for this feature? I would love to use jbrowse 2 but I need this feature to do it and I don't want to have to code it myself as I don't have the time or skill.

My initial thoughts at a solution to this would be that when setting up a track you can add another source file where any features from the main track files are binned, some info can also be provided about these binned regions within the file, such as a breakdown of the types of variants in the region etc. The track type to be displayed will then be set dynamically, either features or a histogram, base on the feature density such as it was in jbrowse 1. This binned file could also be created by JBrowse, something as simple as adding a -b flag when running jbrowse add-track.

cmdcolin commented 3 years ago

is there a particular reason for needing feature density? we currently display a marker "Zoom in to see features" with an additional "force load" button

Tom-Shorter commented 3 years ago

The feature density allows users to see, at a glance, areas of a chromosome which contains the most data, I wouldn't want to have to zoom into a chromosome and have to view no more than 100k bases (or fewer) at a time to figure out which areas are of most interest

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Colin Diesh @.> Sent: Thursday, May 27, 2021 5:09:09 PM To: GMOD/jbrowse-components @.> Cc: Shorter, Tom @.>; Comment @.> Subject: Re: [GMOD/jbrowse-components] Make histogram density tracks for feature tracks (#463)

is there a particular reason for needing feature density? we currently display a marker "Zoom in to see features" with an additional "force load" button

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FGMOD%2Fjbrowse-components%2Fissues%2F463%23issuecomment-849758876&data=04%7C01%7Cts339%40leicester.ac.uk%7Cc02c7d75bfdc4d42038408d92129c38c%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C637577285538879198%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K9KQ1EFvjLBZU6ROurClfBl3nX05eTioXlE%2F3p%2BM6lw%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAF7SMQ6RMMJQ2X5SNPXRMLLTPZVCLANCNFSM4IMAMIOQ&data=04%7C01%7Cts339%40leicester.ac.uk%7Cc02c7d75bfdc4d42038408d92129c38c%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C637577285538889155%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F%2F1z92IK0Jy3bJi966wv2qds7A36xoOkyfuMYRtav2A%3D&reserved=0.

cmdcolin commented 3 years ago

This is a good perspective, thanks.

Given that there are some additional desires such as breakdown of feature type, it seems like there could be some custom needs related to this more than maybe a simple feature histogram, so could be something to keep in mind.

It might be maybe we allow changing between different "display types" at different scales. We can currently toggle this by hand using the track menu.

localhost_3000__config=test_data%2Fvolvox%2Fconfig json session=local-dalgEsxJz (1)

Further, I do agree that some pre-processing would be useful. We currently rely on having, for example, an indexed GFF3 tabix file. In jbrowse 1, we did histogram computation during flatfile-to-json. We might need to add some histogram pre-processing to make this useful in jbrowse 2.

Tom-Shorter commented 3 years ago

Thanks Colin,

I've had a chance to look at the example now and if I'm correct you achieved this by defining three different tracks in the config.json file:

volvox-sorted.cram (ctgA, default display)
volvox-sorted.cram (ctgA, LinearPileupDisplay
volvox-sorted.cram (ctgA, LinearSNPCoverageDisplay)

All 3 of the tracks use the same source file so does the display type option realise this so it allows the track type to be changed?

I doubt this method would work if I needed to define different source files and I doubt that using a VCF file to generate a histogram of chr1 from a track with ~500k variants in chr1 alone would be efficient so I would want to define another source file with the data binned already. I'm not sure of the format of the binned data file but using a wig or bigwig format would likely be the easiest for everyone, and you have already implemented these file formats so hopefully it'd be more of a case of joining together existing jbrowse features and functionality than developing entirely new stuff.

If I could define different track configs for different scales that take effect based on the current display scale and re-draw the track then this would definitely be a solution worth looking into. I could then also define multiple files from which to draw the histogram with bins ranging from 100-1,000,000 or more bases and the file most suitable for the current display scale would be chosen. I did like the dynamic histogram style display for jbrowse1 but the above method would definitely give more control over the display.

scottcain commented 10 months ago

A conversation with a devs and curators at RGD (including @jdepons) when the idea of having coverage histograms came up. The curators indicated they liked having histograms for similar reasons state above (locating regions of high numbers of variants and low numbers of genes or vice versa for example).

laceysanderson commented 10 months ago

Yes, our users have been asking about this feature repeatedly since we moved to JBrowse2. It is definitely missed!

It also provides answers to some higher level questions that researchers have.

Some practical applications/workflows our users have used these histograms for in the past:

Open track with centromere satellite annotation. Zoom all the way out on chromosomes so they can see regions where these satellites are most concentrated. This has led to further research into the weird centromeres in Lentil as the results were not what they expected.
Open tracks with annotated repeats where each track is a specific class of repeats. Zoom all the way out so they can see the whole chromosome to determine if the distribution is even or if we have large repeat islands. Also to see if some chromosomes are showing more gypsy elements than others, for example.
Further to the repeats example above, I've been helping researchers with scripts to put their QTL analysis results into GFF3 so they can add these as a track to their own JBrowse session and see if their QTL are in gene rich regions or if there may be linked more to promoter regions, etc.

Overall having the higher level distribution seems to be very important to my researchers for orienting themselves and determining where they may want to focus their efforts just as @scottcain and @Tom-Shorter mentioned. I just thought I'd add my voice to theirs and provide some specific research question focused workflows in case that is helpful :-)

I've had multiple excited exclamations of how nice it will be to see these histograms now they can easily see the whole genome instead of a single chromosome at a time :-) Especially combined with the synteny views to explore how our wild species differ!

cmdcolin commented 9 months ago

sorry for the delay but thanks for the perspective @laceysanderson . I think the biological applications are really interesting. when I think about this issue, I can see sort of two angles

a) users can get a quick overview when there are too many features to display, and then it automatically switches to a normal feature display when zoomed in enough, but the user is not using the feature density to really gain any biological insight. feature density is maybe not often the "phenomenon of interest" so I do kind of wonder about even displaying feature density in this case, but it is perhaps a 'nice default behavior' in some cases too

b) using the feature density to gain a meaningful biological insight

for (b) I think that making sure that the feature density is well integrated is important, but is it sort of specialized and an admin can manually create extra tracks for their instance perhaps by using e.g. bedtools genomecov, but I guess it remains an issue, that jbrowse can also try to offer this by default somehow. is just auto-toggling based on zoom level the best behavior in the case of (b)? and how can we best calculate the coverages? I know BAM and CRAM coverage can be quite useful also for e.g. CNVs, so maybe integrating with external tools (bedtools genomecov, mosdepth, etc) could be useful

fubar2 commented 6 months ago

@cmdcolin tl;dr: After playing with some very dense tracks, I think lower feature densities work pretty well in JB2, but an automated switch to binned count bar display at some point in zooming out could be helpful for many of the big, complex feature tracks we are working with.

I think the biological applications are really interesting

Not possible to predict in general whether biology will be revealed - it may or may not be depending on the actual context - but automated adaptive views seem more satisfying and consistent to me. Personal taste, but I'd prefer a more informative chromosome view of binned repeat region counts or summed lengths, over a blank track with an invitation to intervene?

My experience with the same data in JB1 suggests that feature density drives adaption of each feature track to suit. For example a bed with more than 1000 or 2000 features in the viewport leaves less than a pixel each for a linear display! At that point it stops making sense to plot normally, even if the user wants to - and a pileup view may serve better, until again, with zooming further out, it gets too tall and too dense to be helpful, so a bar chart (100-200 bars) of windowed scaled counts might make sense if the costs of calculation are reasonable...

This blast from the past example JB1 browser in Galaxy should be viewable - try zooming in. The dense beds scale automagically to barcharts with sensible scales for absurdly dense features, down to segments and sequence at lower density- all at sensible resolutions and without any extra user effort.

an admin can manually create extra tracks for their instance perhaps by using e.g. bedtools genomecov, but I guess it remains an issue

True, but automated switching to bins whenever there are too many (e.g.? > 1-5k) features to display individually in a linear track could help users see very large and complex data without any need to convert to bigwig or other tracks from beds for example and if it's configurable or optional, unlikely to harm?

fubar2 commented 6 months ago

@cmdcolin For my own use case will convert the underlying workflow to make a bigwig but have given the Galaxy JBrowse2 user the choice of a LinearPileupDisplay for enormous bed files. Much better when zoomed out and still useful when zoomed in so probably a good compromise - we'll soon see as it should be available on the European public Galaxy for testing soon...

cmdcolin commented 6 months ago

@fubar2 that is a interesting and good workaround. indeed, our normal FeatureTrack types use SVG based rendering by default, while the LinearPileupDisplay is canvas based. I'd like to make normal FeatureTrack's faster by being canvas probably at some point but we are playing 4d chess so have a lot of todo's. but that is great to hear about the live deployment on galaxy! if you have any interest in getting involved in the codebase, i'd be happy to help onboard

edit: typos

fubar2 commented 6 months ago

if you have any interest in getting involved in the codebase, i'd be happy to help onboard

Thanks @cmdcolin - best I can promise is to arrange easy access to your code for hundreds of thousands of users....
Galaxy's interests are in making the best third party tools accessible, interoperable and convenient for users. Unfortunately, we're not so good at helping build them - unless they are in Python :)

cmdcolin commented 6 months ago

do you have any links about how jbrowse 2 is being integrated? i'd be interested. could also announce it at at our PAG talk (attending currently)

fubar2 commented 6 months ago

@cmdcolin: tl;dr complications abound and it's not likely to be out in time for this PAG.

Fitting an application into Galaxy can have more or less one-off requirements because it is a highly restricted execution environment. Command line tools are relatively easy. The issues with other applications arise because of our peculiar needs - so please don't misunderstand the gory details below as a whinge! I'm thinking out loud about why the technical requirements for integrating a third-party application into Galaxy can get in the way sometimes.

To provide a web representation, the tool requires as much of the application logic as is required to write correct json for a web server. Can generate working track/display json easily enough because the documentation shows how and the CL generated json makes it easy to understand. It's the default session being an internal representation that presents a stumbling block for us at present. API's make it much more predictable.

Only way I've found to figure anything out, such as default session colours in a wiggle track, is to generate "correct" examples with the web admin client and emulate them. Can do once they're available, but does not scale well because future changes to the internal representation could break it.

Removing all track configuration options from the tool is very tempting because it works well for many use-cases and users can adjust their own tracks. The challenge is that Galaxy's secret sauce is to hide all that dreary complexity from users, so they don't need to know any of it just to see their data. So this may take a while to figure out.

cmdcolin commented 6 months ago

I can definitely sympathize, it is hard to know all the config slots and just copying the config like you mention is something i do also

one idea i have had is publishing something like a json schema for our configuration. I have created "auto-generated docs" on our website by parsing our source code, and possibly that could be used to auto-generate a json schema as well. still complicated i am sure to convert that to galaxy but could help. the other option is of course making less knobs available.

we may also be able to create systems where our config and internal state have simpler "knobs" e.g. for config instead of setting the color on the display, just set it on the track. and for defaultSession, many things are very just odd like offsetPx and bpPerPx on linear genome view for location, but we could make it so that it interprets a locString like ctgA:1-100 for example

just brainstorming, but hope that helps. feel free to make another thread as i know this one is a bit off topic now

fubar2 commented 6 months ago

@cmdcolin: thanks for the opportunity to provide some input - even if I am threadjacking. Not sure I'm the right person to start a new thread as I can only urge from the sidelines really. Those generated docs have been very helpful - thanks for those.

, just set it on the track.

Setting on the track sounds like a good place to me if that can be made to work easily. If a default session can be generated automagically from those more complete track settings (with wiggle colour and whatever), using the existing command line tool, to add a default session section to an existing config.json, that would save us a lot of work. The internal state will surely mutate with code updates and new features, so there's a risk that any code that assumes things about it might break with a new release, so probably best to keep out of it if we can?

GMOD / jbrowse-components

Make histogram density tracks for feature tracks #463