4dn-dcic / pairix

1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates
MIT License
83 stars 13 forks source link

Is there a binary spec for the pairix format? #60

Closed cmdcolin closed 4 years ago

cmdcolin commented 5 years ago

Hi there, is a binary spec for pairix or if not just some code I could reference to understand? Thanks :)

SooLee commented 5 years ago

@cmdcolin Apology for not noticing the issue until now! We don't have a binary version for the pairs format. The source code for reading and writing pairix index is in the src folder of the repo. It was written on top of tabix, and the basic concept is the same with just a slight modification.

cmdcolin commented 5 years ago

Sorry for not providing more info. I guess my angle is that I was interested in possibly rewriting a different implementation and needed to just get deep into the implementation to do so. I was curious if the index file is a plain tabix file and then what source code helps in doing the queries (probably need full 2d queries)

SooLee commented 5 years ago

@cmdcolin The index file is a modification from a tabix file, so it wouldn't work with a regular tabix index. Most of the C source code related to building and using index is in src/index.c. I think modifying this file would be sufficient in most cases, though there are other source files in src directory that handle more internal stuff like implementing hash used for the index, etc. Does this answer your question?

cmdcolin commented 5 years ago

My target would be a making a reader of the format for a different language so having a file format specification would be useful similar to hts-specs docs

SooLee commented 5 years ago

We have pypairix for Python (src/pairixmodule.c) and Rpairix for R (https://github.com/4dn-dcic/Rpairix) but both of them use the same underlying C source (src/index.c), so that could be one possibility.

The structure of the index is defined in the same C file. (https://github.com/4dn-dcic/pairix/blob/master/src/index.c#L54)

I will see if I can make a documentation for the index structure (though it's unlikely that I will have time in the next few weeks)

cmdcolin commented 5 years ago

Thanks, it would be a reimplementation would be from scratch instead of via the C source code most likely (javascript) so file spec would be handy

SooLee commented 4 years ago

@cmdcolin My apology for the delay. Here is the index spec. Please let me know if this is what you were looking for. https://github.com/4dn-dcic/pairix/blob/master/pairix_index_spec.pdf