I was trying to parse a BUS file generated by kallisto bus, going along with the BUS file format spec. I encountered the following issues:
There is an extra, undocumented, 4-byte padding of each record. Any reader must skip those four bytes, otherwise the records go out of sync. (EDIT: I saw now that this is actually indirectly documented as the spec says each record is rounded up to 32 bytes.)
The endianness of record fields is not specified (on my Mac, the BUS files were little-endian). It would make sense to specify a single, machine-independent endianness, preferrably little-endian (as it is).
The encoding of the text header is not specified. UTF8 would be a great choice.
(EDIT) The 2-bit encoding of nucleotides is not specified, nor is the endianness of the encoding.
(EDIT) The formats of the auxiliary transcripts.txt and matrix.ec files are not specified. They're pretty obvious, but still.
Also, flags are not specified, but the file had all-zero flags so I guess this is for future use.
I was trying to parse a BUS file generated by
kallisto bus
, going along with the BUS file format spec. I encountered the following issues:There is an extra, undocumented, 4-byte padding of each record. Any reader must skip those four bytes, otherwise the records go out of sync. (EDIT: I saw now that this is actually indirectly documented as the spec says each record is rounded up to 32 bytes.)
The endianness of record fields is not specified (on my Mac, the BUS files were little-endian). It would make sense to specify a single, machine-independent endianness, preferrably little-endian (as it is).
The encoding of the text header is not specified. UTF8 would be a great choice.
(EDIT) The 2-bit encoding of nucleotides is not specified, nor is the endianness of the encoding.
(EDIT) The formats of the auxiliary transcripts.txt and matrix.ec files are not specified. They're pretty obvious, but still.
Also, flags are not specified, but the file had all-zero flags so I guess this is for future use.