dpryan79 / libBigWig

A C library for handling bigWig files
MIT License
73 stars 25 forks source link

Clarification sought on coordinate schemas and positions #17

Closed andrewyatz closed 7 years ago

andrewyatz commented 7 years ago

Hi

I've been looking at building a library for accessing Big files from Perl using your library. It's gone pretty well to be honest but I have some questions about your interpretation of coordinates that's not clear from the documentation. I've pasted in an example of the docs from one of your functions below with some bits removed:

/*!
 * @brief Return bigWig entries overlapping an interval.
 * @param start The start position of the interval. This is 0-based half open, so 0 is the first base.
 * @param end The end position of the interval. Again, this is 0-based half open, so 100 will include the 100th base...which is at position 99.
 */
bwOverlappingIntervals_t *bwGetOverlappingIntervals(bigWigFile_t *fp, char *chrom, uint32_t start, uint32_t end);

I think it's your use of ...which is at position 99 is confusing me. 0-based, half open to me would suggest if you use 100 as your end value you will get the 100th base and its value should always be 100. Unless you referring to the location of base 100's values in the arrays passed back by the routine in the bwOverlappingIntervals_t struct.

Also I'm aware that when parsing BigWigs their use of coordinates can differ based on their source data. Those derived from bedGraphs retain their 0-based, half-open system where fixed and variable step use 1-start, fully-closed. I had a poke in the code and can see some mention of this but I'm unsure if you handle this internally so we need only to have to work in 0-based, half open coordinates.

Thanks and sorry for the badgering.

dpryan79 commented 7 years ago

The first base is at position 0 in a "zero-based half open" coordinate system. This is the same as for BED files and BAM files, for what it's worth. For example, the following specifies the first base in bed format:

chr1 0 1

I didn't come up with the format, this is just for consistency with it.

All bigWigs are always 0-based, the coordinates are never different. Fixed and variable-step files are also 0-based half open. This is different from the wiggle format, which always uses 1-based coordinates.

andrewyatz commented 7 years ago

Don't worry I'm not placing the sole responsibility for the coordinate system at your door. Far from it. I'm part of the Ensembl project and am fully aware of the UCSC coordinate scheme however there are times it catches me out.

As to your last point I believe I got my info from this blog post by the UCSC genome browser team. http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/. Specifically the last bit concerning wiggle files. They way they had structured this it looked to me that fixed and variable step are stored in a BigWig as 1-start fully closed.

However if they're not then I can stop worrying and assume everything is 0-based, half-open that makes everything a lot easier to understand.

Thanks

dpryan79 commented 7 years ago

I think they're just inconsistent in the wiggle format...which is crazy. To my knowledge bigWigs are always consistent here, or at least any produced by this library are.

andrewyatz commented 7 years ago

That's fantastic to hear!