deeptools / pyBigWig

A python extension for quick access to bigWig and bigBed files
MIT License
212 stars 48 forks source link

1-based BigWig file as input #119

Closed dshan25 closed 2 years ago

dshan25 commented 3 years ago

Thanks very much for making this useful tool!

I have one question related to the 0-based half-open coordinates for this extension.

When my input BigWig file is 1-based (e.g. phastcons bw file from UCSC GB), do I still use 0-based half-open coordinates to access the value using pyBigWig? Thanks!

dpryan79 commented 2 years ago

There are no coordinate conversions done, since I'm of the opinion that 1-based bigWig files should never exist.

sjgosai commented 1 year ago

I wanted to point out that the standard spec for wig and bigWig is that coordinates are 1-based. This can be found in the original spec publication here and on the UCSC file spec sheet here (relevant text quoted below).

The bedGraph format is a BED variant in which the fourth column is a floating point value that is associated with all the bases between the chromStart and chromEnd positions. Unlike the zero-based BED and bedGraph, for compatibility reasons the chromosome start positions in variableStep and fixedStep are one-based.

The bedGraph format, like all BED-based formats and most file formats used by UCSC, use "0-start, half-open" coordinates, but the wiggle ASCII text format for variableStep and fixedStep data uses "1-start, fully-closed" coordinates. Wiggle (variableStep and fixedStep) is the only format defined by UCSC that uses a 1-based format, for historical reasons. For example, for a chromosome of length N, the first position is 1 and the last position is N.

Given most labs, large consortium projects, and internet chat board discussions respect this convention, it's legitimate to anticipate tools adhere to the published spec. Absent that, the README should open with this design decision as it can torpedo analyses. FWIW @dpryan79, I agree with your opinion here, but opinions are irrelevant when dealing the reality of the data in the wild.