Koeng101 / dnadesign

A Go package for designing DNA.
Other
23 stars 0 forks source link

fastqindex #57

Closed Koeng101 closed 5 months ago

Koeng101 commented 6 months ago

I want a binary fastqindex similar to https://hasindu2008.github.io/slow5specs/slow5-v1.0.0.pdf

This would mainly be used when writing a large fastq file to a data store, like S3, while still wanting to seek out specific lines from that fastq file. There would be two modifications: standardization of size,

- (2 byte) uint16: length of read ID 
- (var byte) read ID (UUIDs can be used directly or a hash of the identifier can be used). Often 16 byte for UUID
- (8 byte) uint64: start position
- (4 byte) uint32: length

30 bytes in total for a typical run. If a promethion flow cell returns 10,000,000 reads, the index file will be approx 286mb.

Koeng101 commented 6 months ago

Hmm, I think static allocation of bytes might be interesting here.

- (16 byte) read ID (UUIDs can be used directly or a hash of the identifier can be used)
- (8 byte) uint64: start position
- (4 byte) uint32: length

This would allow you to statically allocate the whole index into memory - you can derive the exact number of reads from the byte length of the file, and you can statically allocate a whole bunch of things