Closed ejacox closed 7 years ago
@ejacox @david4096 please add the key points from your slack discussion to this issue.
Thanks for rebooting this effort @ejacox ! Being able to efficiently interchange wiggle data has been a desired feature for a while!
While trying to sidestep the issue of how to name it, I believe that we can interchange "continuous-type" data using the current Feature
message without much change. One could place the values into the attributes field of a feature message.
Although this speaks to the flexibility of the data model, there are two problems with this approach. The first is that there is in some cases an order of magnitude (or more) increase in the transfer size for high resolution data, since we are repeating the start
, end
, reference_name
, and feature_id
on each value. The data we are trying to interchange has, at most, a resolution of a single base, so in some cases this may make interchange of a Wiggle file take an order magnitude or more time to transfer than the file.
The second problem is that these data often include missing values, for example, coverage counts of the exome will not present values for intergenic regions. Empty values and missing values may have different meaning in the data model, and returning a feature that says no value has been set is confusing.
From a non-technical perspective, the Feature message derives from the sequence ontology for representing the type of a feature. Wiggles are usually results of analyses performed on genomic data, so calling a wiggle value
a Feature
is perhaps an abuse of the data model. All this to convince myself that we shouldn't put the effort in to demonstrate interchanging these data using the current Feature model.
Now to your questions. First, I think that merging this with sequence annotations seems appropriate.
Second, I understand the distaste for wiggle, I don't rush to the term "continuous". Continuous, to me, should be reserved for truly continuous values. Say, for example, I wanted to model the horizontal charge distribution along the genome to predict coiling patterns. In that case, I would specify a continuous function across the length of the genome that has values at every point. Although I don't expect these analytical functions to be present in the API, I would need to be able to construct interbase requests to properly model them. In a mathematical sense Wiggle files are always discrete and usually discontinuous.
@ejacox offered the heuristic of thinking of Wiggles as streams of values in genome space. This is a useful way of thinking and leads me to the idea of a trace. Genomic data can be analyzed into a stream of scalars that define their own semantic meaning. A primary use case of wiggles is to describe the coverage of some genomic data--a stream of counts made in genome space. In this way, I can think of the wiggle as a trace of that stream of counts. Requesting a range replays the stream of counts as data, and so these wiggles are traces of streams.
Although it doesn't roll off the tongue, I'll propose Trace
and TraceSet
. The term has a secondary cognitive trick: creating a wiggle becomes anecdotally "tracing" some data.
With that in mind I'd like to consider how to associate Trace Sets with other message types. Standing alone it loses it's semantic meaning and it's unclear from the outside how it relates to the remainder of the API. For example, if I upload a trace set that describes the coverage of my read group set, this relationship could be made more clear to the client. Although trace sets are not necessarily derived data, providing some optional fields for linking it to other types seems important.
2
and span is 1
, what happens? Are there valid and invalid cases? Can span alone cover the use case of step?Thank you @david4096 for adding the comments. Some of your other comments: "here's some Avro for a first pass at wiggles https://github.com/ga4gh/schemas/pull/246/files".
" It seems to me that to get the right interchange fidelity you need the span
and step
fields using this model."
If step is 2 and span is 1, then every other base is sampled.
Step and span aren't part of the search range.
Continuous isn't just chip-seq. Isn't the info field for experimental details?
We can definitely just put anything else into the info
field that we don't know what to do with for now, but we have an Experiment
message in assay_metadata.proto
that seems like it might be a common use case for interchange these data. If we add Experiment to the possible types in the attributes
message (https://github.com/ga4gh/schemas/pull/700) then we would be able to provide useful fields for implementors to fill out to give the metadata for a Trace. However, if it's common enough that one would use this to interchange data generated via an Analysis
or Experiment
we might consider at least including some optional metadata fields.
If there is a semantic relationship between Trace's and some other data type present in the API, we should make that as clear and useful as possible.
message Continuous {
// The start position at which this signal occurs (0-based). This
// corresponds to the first base of the string of reference bases. Genomic
// positions are non-negative integers less than reference length.
int64 start = 1;
// The contiguous data values. Unsampled bases are given as NaN.
repeated double values = 2;
// Identifier for the containing continous set.
string continuous_set_id = 3;
// The reference on which this signal is defined (e.g. `chr20` or `X`).
string reference_name = 4;
}
This is the latest. I removed step and span in favor of just values with NaN for unsampled values. I also added reference_name and continuous_set_id, which just echo back what was in the query.
ContinuousSet did not change. It is still just a copy of FeatureSet.
I created the service with: GetContinuousSet SearchContinuousSet SearchContinuous
I added them to sequence_annotation protos, though they are completely separate. https://github.com/ejacox/schemas/blob/bigwig/src/main/proto/ga4gh/sequence_annotations.proto https://github.com/ejacox/schemas/blob/bigwig/src/main/proto/ga4gh/sequence_annotation_service.proto
I also implemented it in the server. I did not copy all of the FeatureSet tests yet, as these are copies of existing code. I will and them once the schema is established. https://github.com/ejacox/server/tree/bigwig
Various points:
Thanks for the comments @david4096.
The only fidelity issue I saw was if span and step can be changed. This doesn't apply in this case.
Continuous.start is the base position of Continuous.values[0]. Continuous.values[i] is the value of base position Continuous.start+i. Is that clearer?
I agree. That seems better. Not everywhere: the ListReferencesBases throws an error if value is out of range.
I can set a limit on the values size and break it up into multiple protocol objects. Any idea how many doubles fit it in 2MB?
For mango related interest @akmorrow13
@maximilianh would you mind taking a few minutes to comment on our approach to adding BigWig data support to the GA4GH schemas?
Looks good, though I'm not always sure what the final result after the discussion above is.
Naming: gbrowse calls wiggle "xyplot". Ensembl calls it continous data. http://uswest.ensembl.org/info/website/upload/wig.html
spec: Can your format transmit the wiggle "variableStep" case?
variableStep chrom=chr21 span=5 9411191 50 9411196 40 9411201 60 9411206 20 9411211 20 9411216 20 9411221 40 9411226 60 9411231 40 9411236 40 9411241 40 9411246 40 9411251 40 9411256 60
Thank you for your comments @maximilianh. The final format is just a starting position and an array of values (plus the reference name and the dataset id). The format can transmit any format. It converts step or span to array values. So, at positions 9411191 to 9411195 there would be 50. Any gaps are filled with NaN. We might add span or step into the format later.
there can be many multiple positions per reference, right? Some wiggle files are very sparse and cover only the exons, so having one starting position per exon would save many NaNs.
Indeed. We are working with the assumption that most queries will be for shorter regions. We can look at performance and add compression (e.g. span and step) later as needed. Another possibility is to send a series of messages with arrays covering only NaN values. This can be handled by the proposed message format already.
It was kept simple so that we can move forward with actually using it, with the idea of expanding the format for better performance later after we have thought through any issues.
@dzerbino please review
Hello all,
This looks mainly reasonable, and I agree with the way most threads of conversation converged.
I think the Continuous message could do with an 'end' variable. The values array could then be mapped uniformly over a region, even if length(array) != (end - start). This scaled projection would be very useful if the client (typically a web browser) does not want full resolution data. It would therefore want to request the signal over, say chrom1:0-1,000,000 but in 1000 bins only because there are only so many pixels on the screen. This ability to request desired resolution is a key element of the BigWig success, and should be embedded in the SearchContinuousRequest function.
HTH,
Daniel
Thanks for the comment @dzerbino. Would another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?
window_size would be a good explicit parameter, as step had a slightly different meaning in BigWig-speak (in BigWig, step and span are not necessarily equal, so you could define non-contiguous bins).
I would want one of (or all of) (mean, standard deviation, max, min) per bin. These can all be extracted one the command line using the Kent code in /kent/src/utils/bigWigSummary/.
On Mon, Feb 20, 2017 at 4:18 PM, ejacox notifications@github.com wrote:
Thanks for the comment @dzerbino https://github.com/dzerbino. Would another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281120917, or mute the thread https://github.com/notifications/unsubscribe-auth/AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb .
This is a great idea. I would have said exactly the same thing, only didn't want to add too much to your plate. But Daniel is entirely right, without summary levels the API cannot be used for browsing.
On Feb 20, 2017 10:30 AM, "Daniel Zerbino" notifications@github.com wrote:
window_size would be a good explicit parameter, as step had a slightly different meaning in BigWig-speak (in BigWig, step and span are not necessarily equal, so you could define non-contiguous bins).
I would want one of (or all of) (mean, standard deviation, max, min) per bin. These can all be extracted one the command line using the Kent code in /kent/src/utils/bigWigSummary/.
On Mon, Feb 20, 2017 at 4:18 PM, ejacox notifications@github.com wrote:
Thanks for the comment @dzerbino https://github.com/dzerbino. Would
another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281120917, or mute the thread https://github.com/notifications/unsubscribe-auth/ AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/schemas/issues/769#issuecomment-281151217, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TX2QTd9AiCjq9OyA9AIY3T5-p3WAks5redvUgaJpZM4LbfSb .
Thanks @maximilianh and @dzerbino your points are well taken! Practically speaking, having control over zoom level is important and we would like to make useful the BigWig tools via the API. We would like to work from the basis @ejacox has provided and begin to evaluate "zoom" features.
As noted by @dzerbino this schema may work nicely for some projections by simply adding an end
field.
The general problem, how to deal with sparse data in a region, is one that has in part been solved in mango with the addition of a binned coverage count. Since BigWig are real-valued, the analogy might not hold, but I believe that more modeling will result in a understandable and useful access pattern that allows us to bin data across the API. https://github.com/ga4gh/schemas/pull/739
Please add your further suggestions about zoom levels and summary statistics for Continuous Data here so that we might close this issue by merging https://github.com/ga4gh/schemas/pull/802 .
This issue was created in order to define a continuous data message type.
Continuous, or signal data, associates a value with sequence base pairs. This can be raw experimental data, such as ChIP-Seq data, or calculated values, such as conservation scores. There can be gaps (unsampled bases) in the values, so more than a simple array of values with a start point is needed. Such data is commonly stored in wiggle, bigwig, and bedgraph formats. In addition to just listing values, the formats can do some simple compression: either 'step', which can more compactly represent regularly sampled data, or 'span', which can more compactly represent windowed data.
There are at least a couple of points to discuss:
From PR #591 "The name was chosen because SequenceAnnotations will include more than Features; The second data type is continuous (wiggle) was not include in this PR to speed implementation."
From this discussion, it seems that it should be a part of the sequence_annotations. However, most reference to "sequence annotations" refer to genomic annotations, i.e. associated with genes, which does not apply here. The continous data could be its own service in that case.
Wiggle doesn't seem appropriate. 'Continuous' has been used as a place holder.
A draft schema follows:
`// This protocol defines a format for exchanging continous valued signal data, // such as those produced experimentally (e.g. ChIP-Seq data) or through // calculations (e.g. conservation scores). It can be used, for example, // to share data stored in Wiggle, BigWig, and BedGraph formats. // // Only bases with a signal are represented. Gaps in the values indicate bases // with no signal. // // Step and span from the wiggle format are used for a simple compression.
// A chunk of continuous data. Due to gaps in the signal, the values // cannot be represented by a single array, but require a set of arrays. message Continuous { // The start position at which this signal occurs (0-based). This // corresponds to the first base of the string of reference bases. Genomic // positions are non-negative integers less than reference length. int64 start = 1;
// If not one, values are defined for every 'step' base, leaving // gaps of undefined regions. int32 step = 2;
// The number of bases each value spans. For example if span is 5, // then the first value is defined over start plus the next 4 bases, the // second value over the following 5 bases, ... int32 span = 3;
// The data values. repeated double values = 4; }
// A set of continuous data. message ContinuousSet { // The ID of this continuous data set. string id = 1;
// The ID of the dataset this set belongs to. string dataset_id = 2;
// The ID of the reference set which defines the coordinate-space for this // set of data. string reference_set_id = 3;
// The display name for this dataset. string name = 4;
// The source URI describing the file from which this set was // generated, if any. string source_uri = 5;
// Remaining structured metadata key-value pairs. map<string, google.protobuf.ListValue> info = 6; }`