TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
82 stars 13 forks source link

Array size utility script #699

Closed gspowley closed 2 months ago

gspowley commented 2 months ago

A utility script to provide the size of an array broken down per column, optionally displaying the filters.

$ array-size.py array_uri
              name     type     data  offsets  validity    total  percent
0              fmt    uint8  8483150    66593         0  8549743   50.064
1             info    uint8  5989054    54952         0  6044006   35.391
2             qual  float32   529503        0         0   529503    3.101
3          end_pos   uint32   455988        0         0   455988    2.670
4   real_start_pos   uint32   455918        0         0   455918    2.670
5       start_pos*   uint32   441214        0         0   441214    2.584
6          alleles    bytes   329757   105369         0   435126    2.548
7           fmt_GT    uint8   119494     9184         0   128678    0.753
8       filter_ids    int32    10569     9184         0    19753    0.116
9               id    bytes     3887     9208         0    13095    0.077
10         sample*    bytes     2862      216         0     3078    0.018
11         contig*    bytes     1350      216         0     1566    0.009
Total size: 16.29 MiB

with filters:

$ array-size.py --filter array_uri
              name     type                                                                                 filter     data  offsets  validity    total  percent
0              fmt    uint8                                          (ZstdFilter(level=4), ChecksumSHA256Filter())  8483150    66593         0  8549743   50.064
1             info    uint8                                          (ZstdFilter(level=4), ChecksumSHA256Filter())  5989054    54952         0  6044006   35.391
2             qual  float32                                          (ZstdFilter(level=4), ChecksumSHA256Filter())   529503        0         0   529503    3.101
3          end_pos   uint32                     (ByteShuffleFilter(), ZstdFilter(level=4), ChecksumSHA256Filter())   455988        0         0   455988    2.670
4   real_start_pos   uint32                     (ByteShuffleFilter(), ZstdFilter(level=4), ChecksumSHA256Filter())   455918        0         0   455918    2.670
5       start_pos*   uint32  (DoubleDeltaFilter(reinterp_dtype=None), ZstdFilter(level=4), ChecksumSHA256Filter())   441214        0         0   441214    2.584
6          alleles    bytes                                          (ZstdFilter(level=4), ChecksumSHA256Filter())   329757   105369         0   435126    2.548
7           fmt_GT    uint8                                          (ZstdFilter(level=4), ChecksumSHA256Filter())   119494     9184         0   128678    0.753
8       filter_ids    int32                     (ByteShuffleFilter(), ZstdFilter(level=4), ChecksumSHA256Filter())    10569     9184         0    19753    0.116
9               id    bytes                                          (ZstdFilter(level=4), ChecksumSHA256Filter())     3887     9208         0    13095    0.077
10         sample*    bytes                                              (DictionaryFilter(), ZstdFilter(level=4))     2862      216         0     3078    0.018
11         contig*    bytes                                                                          (RleFilter())     1350      216         0     1566    0.009
Total size: 16.29 MiB