KirillKryukov / naf

Nucleotide Archival Format - Compressed file format for DNA/RNA/protein sequences
http://kirill-kryukov.com/study/naf/
zlib License
56 stars 6 forks source link

Function to compress / decompress sequence only #10

Open jguhlin opened 4 years ago

jguhlin commented 4 years ago

Is there or could there be a function to compress / decompress sequence only? I have a format I'm using as a replacement for FASTA now, and using zstd as the sequence compression (seq ID's are not compressed). If you could expose a function to do compression/decompression or point me to the right place, I could include naf as an option there.

Thanks

KirillKryukov commented 4 years ago

You can compress nameless sequence using naf. Just leave the IDs empty and it will work fine. You can have one or many such nameless sequences in a naf archive without problem. As for decompressing, "unnaf --seq" produces just the sequence (without IDs, and all sequences are concatenated into one). Also "unnaf --sequences" produces sequences one per line (without IDs).

There's no API currently, but naf code is relatively tiny and straightforward. I guess the relevant code could be embedded into your project if you'd like to avoid calling an external binary and streaming the data back and forth. It would be nice if you supported naf as option in your format, and I'll be glad to help if you need adjustment from my side.

jguhlin commented 4 years ago

Thanks, I'll see if I can convert the code / embed it somehow. As it's definitely a for-speed algo I don't want to do any streaming. Will reach out if I need anything! Cheers