Continuous Data - Githubissues

ejacox commented 7 years ago

This issue was created in order to define a continuous data message type.

Continuous, or signal data, associates a value with sequence base pairs. This can be raw experimental data, such as ChIP-Seq data, or calculated values, such as conservation scores. There can be gaps (unsampled bases) in the values, so more than a simple array of values with a start point is needed. Such data is commonly stored in wiggle, bigwig, and bedgraph formats. In addition to just listing values, the formats can do some simple compression: either 'step', which can more compactly represent regularly sampled data, or 'span', which can more compactly represent windowed data.

There are at least a couple of points to discuss:

Where does it go?

From PR #591 "The name was chosen because SequenceAnnotations will include more than Features; The second data type is continuous (wiggle) was not include in this PR to speed implementation."

From this discussion, it seems that it should be a part of the sequence_annotations. However, most reference to "sequence annotations" refer to genomic annotations, i.e. associated with genes, which does not apply here. The continous data could be its own service in that case.

What should it be called?

Wiggle doesn't seem appropriate. 'Continuous' has been used as a place holder.

A draft schema follows:

`// This protocol defines a format for exchanging continous valued signal data, // such as those produced experimentally (e.g. ChIP-Seq data) or through // calculations (e.g. conservation scores). It can be used, for example, // to share data stored in Wiggle, BigWig, and BedGraph formats. // // Only bases with a signal are represented. Gaps in the values indicate bases // with no signal. // // Step and span from the wiggle format are used for a simple compression.

// A chunk of continuous data. Due to gaps in the signal, the values // cannot be represented by a single array, but require a set of arrays. message Continuous { // The start position at which this signal occurs (0-based). This // corresponds to the first base of the string of reference bases. Genomic // positions are non-negative integers less than reference length. int64 start = 1;

// If not one, values are defined for every 'step' base, leaving // gaps of undefined regions. int32 step = 2;

// The number of bases each value spans. For example if span is 5, // then the first value is defined over start plus the next 4 bases, the // second value over the following 5 bases, ... int32 span = 3;

// The data values. repeated double values = 4; }

// A set of continuous data. message ContinuousSet { // The ID of this continuous data set. string id = 1;

// The ID of the dataset this set belongs to. string dataset_id = 2;

// The ID of the reference set which defines the coordinate-space for this // set of data. string reference_set_id = 3;

// The display name for this dataset. string name = 4;

// The source URI describing the file from which this set was // generated, if any. string source_uri = 5;

// Remaining structured metadata key-value pairs. map<string, google.protobuf.ListValue> info = 6; }`

kozbo commented 7 years ago

@ejacox @david4096 please add the key points from your slack discussion to this issue.

david4096 commented 7 years ago

Thanks for rebooting this effort @ejacox ! Being able to efficiently interchange wiggle data has been a desired feature for a while!

Adding new messages to the data model

While trying to sidestep the issue of how to name it, I believe that we can interchange "continuous-type" data using the current Feature message without much change. One could place the values into the attributes field of a feature message.

Although this speaks to the flexibility of the data model, there are two problems with this approach. The first is that there is in some cases an order of magnitude (or more) increase in the transfer size for high resolution data, since we are repeating the start, end, reference_name, and feature_id on each value. The data we are trying to interchange has, at most, a resolution of a single base, so in some cases this may make interchange of a Wiggle file take an order magnitude or more time to transfer than the file.

The second problem is that these data often include missing values, for example, coverage counts of the exome will not present values for intergenic regions. Empty values and missing values may have different meaning in the data model, and returning a feature that says no value has been set is confusing.

From a non-technical perspective, the Feature message derives from the sequence ontology for representing the type of a feature. Wiggles are usually results of analyses performed on genomic data, so calling a wiggle value a Feature is perhaps an abuse of the data model. All this to convince myself that we shouldn't put the effort in to demonstrate interchanging these data using the current Feature model.

Now to your questions. First, I think that merging this with sequence annotations seems appropriate.

Naming and continuity

Second, I understand the distaste for wiggle, I don't rush to the term "continuous". Continuous, to me, should be reserved for truly continuous values. Say, for example, I wanted to model the horizontal charge distribution along the genome to predict coiling patterns. In that case, I would specify a continuous function across the length of the genome that has values at every point. Although I don't expect these analytical functions to be present in the API, I would need to be able to construct interbase requests to properly model them. In a mathematical sense Wiggle files are always discrete and usually discontinuous.

@ejacox offered the heuristic of thinking of Wiggles as streams of values in genome space. This is a useful way of thinking and leads me to the idea of a trace. Genomic data can be analyzed into a stream of scalars that define their own semantic meaning. A primary use case of wiggles is to describe the coverage of some genomic data--a stream of counts made in genome space. In this way, I can think of the wiggle as a trace of that stream of counts. Requesting a range replays the stream of counts as data, and so these wiggles are traces of streams.

Trace Set

Although it doesn't roll off the tongue, I'll propose Trace and TraceSet. The term has a secondary cognitive trick: creating a wiggle becomes anecdotally "tracing" some data.

With that in mind I'd like to consider how to associate Trace Sets with other message types. Standing alone it loses it's semantic meaning and it's unclear from the outside how it relates to the remainder of the API. For example, if I upload a trace set that describes the coverage of my read group set, this relationship could be made more clear to the client. Although trace sets are not necessarily derived data, providing some optional fields for linking it to other types seems important.

Open questions

If step is 2 and span is 1, what happens? Are there valid and invalid cases? Can span alone cover the use case of step?
How to define searches? Range searches in the style of the rest of the API should work as long as we respect the step and span. e.g. return overlapping values defined on the range.
Are there any other metadata fields that are important to include to interchange chip-seq data? If these are experimentally generated should we attach an experiment message?

ejacox commented 7 years ago

Thank you @david4096 for adding the comments. Some of your other comments: "here's some Avro for a first pass at wiggles https://github.com/ga4gh/schemas/pull/246/files".

" It seems to me that to get the right interchange fidelity you need the span and step fields using this model."

ejacox commented 7 years ago

If step is 2 and span is 1, then every other base is sampled.

Step and span aren't part of the search range.

Continuous isn't just chip-seq. Isn't the info field for experimental details?

david4096 commented 7 years ago

We can definitely just put anything else into the info field that we don't know what to do with for now, but we have an Experiment message in assay_metadata.proto that seems like it might be a common use case for interchange these data. If we add Experiment to the possible types in the attributes message (https://github.com/ga4gh/schemas/pull/700) then we would be able to provide useful fields for implementors to fill out to give the metadata for a Trace. However, if it's common enough that one would use this to interchange data generated via an Analysis or Experiment we might consider at least including some optional metadata fields.

If there is a semantic relationship between Trace's and some other data type present in the API, we should make that as clear and useful as possible.

ejacox commented 7 years ago

message Continuous {
  // The start position at which this signal occurs (0-based). This
  // corresponds to the first base of the string of reference bases. Genomic
  // positions are non-negative integers less than reference length.
  int64 start = 1;

  // The contiguous data values. Unsampled bases are given as NaN.
  repeated double values = 2;

  // Identifier for the containing continous set.
  string continuous_set_id = 3;

  // The reference on which this signal is defined (e.g. `chr20` or `X`).
  string reference_name = 4;
}

This is the latest. I removed step and span in favor of just values with NaN for unsampled values. I also added reference_name and continuous_set_id, which just echo back what was in the query.

ContinuousSet did not change. It is still just a copy of FeatureSet.

I created the service with: GetContinuousSet SearchContinuousSet SearchContinuous

I added them to sequence_annotation protos, though they are completely separate. https://github.com/ejacox/schemas/blob/bigwig/src/main/proto/ga4gh/sequence_annotations.proto https://github.com/ejacox/schemas/blob/bigwig/src/main/proto/ga4gh/sequence_annotation_service.proto

I also implemented it in the server. I did not copy all of the FeatureSet tests yet, as these are copies of existing code. I will and them once the schema is established. https://github.com/ejacox/server/tree/bigwig

Various points:

The name is still 'Continuous'. I favor using a name that has been associated with this type of data before. The only things I could find were 'continuous' and, less frequently, 'signal'. If we are going to create a novel name, I think that a strong consensus on the choice would be best
The ContinuousSet is an exact copy of FeatureSet. Does ontolgy still apply to continuous data? I thought about just using FeatureSet. The name would be confusing though. Also, we would need to add a flag to say whether it is continuous or feature data.
I am only doing BigWig at the moment. BigWig could be the best internal format. It is indexed and compact. There are tools to convert Wiggle and BedGraph to BigWig. Can we require that the tools be downloaded before using other types?
A continuous query returns a single Continuous object with an array which covers a subset of the query region. NaN are used for unsampled values. NaN values are trimmed from the start and end of the array. The format is extensible. It can return just non-NaN values as a list of protocol objects. Also, span could be easily included in the message (indicating that the values repeat times from the start position). I think step would be more complicated.
Exceptions are thrown if the query extends beyond the end of the reference (as it is recorded in the BigWig file). Is this the best behavior? Otherwise a query partially overlapping the reference can return just the valid region. An exception could be raised if there is no overlap.
Can we put a limit on the size of the values array returned? How would this best be done? As part of the ContinuousSet message?

david4096 commented 7 years ago

One last stand for using Signal (or Trace?) instead, then I'll concede. We'll see if anyone fights for Wiggle.
I don't think ontology applies in the same way to signal data. It's usually a count of some sequence feature, or a scalar about a sequence feature. Adding it as another message type instead of feature set makes sense to me because of how different the RPC method is.
Yes, BigWig is great, especially if you provide documentation of using those tools to convert to it. We just want to be very confident that we aren't losing fidelity on some other common interchange format in the process.
Nice to trim the NaN values, does that change the start of the returned Continuous as well? I'm trying to wrap my head around writing the visualizer for this and what numbers you need to keep track of to project it back into genome space (if any).Could you type out some pseudocode example?
No, we should accept requests outside of the range of the reference and just impute the max if the tools throw an exception there. That's how it works elsewhere.
Do you have practical expectations here? Protobuf sets a soft limit around 2MB for practical reasons which we have kept to. If there are signals that can be larger than that, then we probably have to allow clients to page through a signal somehow.

ejacox commented 7 years ago

Thanks for the comments @david4096.

The only fidelity issue I saw was if span and step can be changed. This doesn't apply in this case.
Continuous.start is the base position of Continuous.values[0]. Continuous.values[i] is the value of base position Continuous.start+i. Is that clearer?
I agree. That seems better. Not everywhere: the ListReferencesBases throws an error if value is out of range.
I can set a limit on the values size and break it up into multiple protocol objects. Any idea how many doubles fit it in 2MB?

david4096 commented 7 years ago

For mango related interest @akmorrow13

kozbo commented 7 years ago

@maximilianh would you mind taking a few minutes to comment on our approach to adding BigWig data support to the GA4GH schemas?

maximilianh commented 7 years ago

Looks good, though I'm not always sure what the final result after the discussion above is.

Naming: gbrowse calls wiggle "xyplot". Ensembl calls it continous data. http://uswest.ensembl.org/info/website/upload/wig.html

spec: Can your format transmit the wiggle "variableStep" case?

variableStep chrom=chr21 span=5 9411191 50 9411196 40 9411201 60 9411206 20 9411211 20 9411216 20 9411221 40 9411226 60 9411231 40 9411236 40 9411241 40 9411246 40 9411251 40 9411256 60

ejacox commented 7 years ago

Thank you for your comments @maximilianh. The final format is just a starting position and an array of values (plus the reference name and the dataset id). The format can transmit any format. It converts step or span to array values. So, at positions 9411191 to 9411195 there would be 50. Any gaps are filled with NaN. We might add span or step into the format later.

maximilianh commented 7 years ago

there can be many multiple positions per reference, right? Some wiggle files are very sparse and cover only the exons, so having one starting position per exon would save many NaNs.

ejacox commented 7 years ago

Indeed. We are working with the assumption that most queries will be for shorter regions. We can look at performance and add compression (e.g. span and step) later as needed. Another possibility is to send a series of messages with arrays covering only NaN values. This can be handled by the proposed message format already.

It was kept simple so that we can move forward with actually using it, with the idea of expanding the format for better performance later after we have thought through any issues.

kozbo commented 7 years ago

@dzerbino please review

dzerbino commented 7 years ago

Hello all,

This looks mainly reasonable, and I agree with the way most threads of conversation converged.

I think the Continuous message could do with an 'end' variable. The values array could then be mapped uniformly over a region, even if length(array) != (end - start). This scaled projection would be very useful if the client (typically a web browser) does not want full resolution data. It would therefore want to request the signal over, say chrom1:0-1,000,000 but in 1000 bins only because there are only so many pixels on the screen. This ability to request desired resolution is a key element of the BigWig success, and should be embedded in the SearchContinuousRequest function.

HTH,

Daniel

ejacox commented 7 years ago

Thanks for the comment @dzerbino. Would another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?

dzerbino commented 7 years ago

window_size would be a good explicit parameter, as step had a slightly different meaning in BigWig-speak (in BigWig, step and span are not necessarily equal, so you could define non-contiguous bins).

I would want one of (or all of) (mean, standard deviation, max, min) per bin. These can all be extracted one the command line using the Kent code in /kent/src/utils/bigWigSummary/.

On Mon, Feb 20, 2017 at 4:18 PM, ejacox notifications@github.com wrote:

Thanks for the comment @dzerbino https://github.com/dzerbino. Would another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281120917, or mute the thread https://github.com/notifications/unsubscribe-auth/AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb .

maximilianh commented 7 years ago

This is a great idea. I would have said exactly the same thing, only didn't want to add too much to your plate. But Daniel is entirely right, without summary levels the API cannot be used for browsing.

On Feb 20, 2017 10:30 AM, "Daniel Zerbino" notifications@github.com wrote:

window_size would be a good explicit parameter, as step had a slightly different meaning in BigWig-speak (in BigWig, step and span are not necessarily equal, so you could define non-contiguous bins).

I would want one of (or all of) (mean, standard deviation, max, min) per bin. These can all be extracted one the command line using the Kent code in /kent/src/utils/bigWigSummary/.

On Mon, Feb 20, 2017 at 4:18 PM, ejacox notifications@github.com wrote:

Thanks for the comment @dzerbino https://github.com/dzerbino. Would

another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281120917, or mute the thread https://github.com/notifications/unsubscribe-auth/ AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281151217, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TX2QTd9AiCjq9OyA9AIY3T5-p3WAks5redvUgaJpZM4LbfSb .

david4096 commented 7 years ago

Thanks @maximilianh and @dzerbino your points are well taken! Practically speaking, having control over zoom level is important and we would like to make useful the BigWig tools via the API. We would like to work from the basis @ejacox has provided and begin to evaluate "zoom" features.

As noted by @dzerbino this schema may work nicely for some projections by simply adding an end field.

The general problem, how to deal with sparse data in a region, is one that has in part been solved in mango with the addition of a binned coverage count. Since BigWig are real-valued, the analogy might not hold, but I believe that more modeling will result in a understandable and useful access pattern that allows us to bin data across the API. https://github.com/ga4gh/schemas/pull/739

Please add your further suggestions about zoom levels and summary statistics for Continuous Data here so that we might close this issue by merging https://github.com/ga4gh/schemas/pull/802 .

ga4gh / ga4gh-schemas

Continuous Data #769

Adding new messages to the data model

Naming and continuity

Trace Set

Open questions