KirillKryukov / naf

Nucleotide Archival Format - Compressed file format for DNA/RNA/protein sequences
http://kirill-kryukov.com/study/naf/
zlib License
56 stars 6 forks source link

bioconda installation #12

Closed pratas closed 3 years ago

pratas commented 3 years ago

Hi Kirill,

would it be possible to add NAF to bioconda? (I guess it would be highly used by the community after a while, for example in pipelines)

Best regards, Diogo

KirillKryukov commented 3 years ago

Hi Diogo,

This is a good idea! I will take a look.

(Also, in case if you are familiar with bioconda's recipe format, any help is welcome).

Thanks, Kirill

KirillKryukov commented 3 years ago

Hi Diogo,

I submitted a recipe pull request: https://github.com/bioconda/bioconda-recipes/pull/31566

Best, Kirill

EDIT: Merged now.

pratas commented 2 years ago

Hi Kirill,

sorry the late answer (demanding week...) I just have tested the bioconda installation and seems to be working fine. Thank you for the work!

I just found a small issue, which is

$ ennaf -o VDB.naf VDB.fa ennaf error: temporary directory is not specified. Please either set TMPDIR or TMP environment variable, or add '--temp-dir DIR' to command line.

$ ennaf --temp-dir xxx -o VDB.naf VDB.fa ennaf error: can't create temporary file "xxx/VDB.fa.sequence"

But if I make the dir (mkdir xxx) it works fine.

I guess is a Linux security constrain. Fixing this or adding it to the README.am would be great for non familiar users.

I just compressed a database of viral genomes (near 1.5 GB) with ennaf to up 76 MB and decompressed it in a few seconds (this is great! Thank you for the effort!)

By the way, probably you are aware of this (but i didn't know)... Here it goes:

cmp OUT VDB.fa OUT VDB.fa differ: byte 8963858, line 126912

$ head -n 126915 OUT | tail -n 10 GTCTTGGCGCCGGTCCTGTGTCTGTTTCTGCGGCCGGCGTTCTCGCCCCGCATTCTGCTTTAGCTATGCT TGAAGATACTATTGATTACCCTGCTCGCGCCCATACTTTTGATGATTTCTGCCCTGAGTGCCGCAATCTT GGTCTACAGGGCTGTGCTTTTCAATCTACTATCGCTGAGCTTCAGCGCCTTAAAATGAAGGTAGGTAAGA CCCGGGAGTCCTAATTAATTTCCCTCTTGTGCCCCCTTCTGAGTTCTGCTTTATTTCTTTTTTCTGCGTT TCGCGCTCCCTGGAAAAAAAAAAAAAAAA

AB290917.1 |Torque teno midi virus 1 DNA| complete genome| isolate: MD1-032|Japan|Homo sapiens|Torque teno midi virus 1|complete GGGTGGAGACTTTTAAACTATATAAGTAAGTAGGGTGGTGAATGGCTGAGTTTACCCCGCTAGACGGTGC AGGGACCGGATCGAGCGCAGCGAGGAGGTCCCCGGCTGCCCATGGGCGGGAGCCCGAGGTGAGTGAAACC ACCGAGGTCTAGGGGCAATTCGGGCTAGGGCAGTCTAGCGGAACGGGCAAGAAACTTAAATATATTTCTT TTACAGATGAACAACCTATCAGCTCAAGACTTCTACAAAAAATGTACCTACAACTCAGAAACCAGAAACC

$ head -n 126915 VDB.fa | tail -n 10 GTCTTGGCGCCGGTCCTGTGTCTGTTTCTGCGGCCGGCGTTCTCGCCCCGCATTCTGCTTTAGCTATGCT TGAAGATACTATTGATTACCCTGCTCGCGCCCATACTTTTGATGATTTCTGCCCTGAGTGCCGCAATCTT GGTCTACAGGGCTGTGCTTTTCAATCTACTATCGCTGAGCTTCAGCGCCTTAAAATGAAGGTAGGTAAGA CCCGGGAGTCCTAATTAATTTCCCTCTTGTGCCCCCTTCTGAGTTCTGCTTTATTTCTTTTTTCTGCGTT TCGCGCTCCCTGGAAAAAAAAAAAAAAAA

AB290917.1 |Torque teno midi virus 1 DNA| complete genome| isolate: MD1-032|Japan|Homo sapiens|Torque teno midi virus 1|complete GGGTGGAGACTTTTAAACTATATAAGTAAGTAGGGTGGTGAATGGCTGAGTTTACCCCGC TAGACGGTGCAGGGACCGGATCGAGCGCAGCGAGGAGGTCCCCGGCTGCCCATGGGCGGG AGCCCGAGGTGAGTGAAACCACCGAGGTCTAGGGGCAATTCGGGCTAGGGCAGTCTAGCG GAACGGGCAAGAAACTTAAATATATTTCTTTTACAGATGAACAACCTATCAGCTCAAGAC

I guess it happens because the \n of the sequences have different size. Perhaps adding the information to the README.am would be nice (if it is there and I was in blind mode, which I am not sure given the multiple hours that I had in front of this screen today, just ignore...)

Best regards, Diogo

KirillKryukov commented 2 years ago

Hi Diogo,

Thanks for the kind words, from you it means a lot!

For the temporary directory issue, it is mentioned in the compression manual ( https://github.com/KirillKryukov/naf/blob/master/Compress.md#temporary-storage ). Basically, the idea is that a substantial temporary storage might be needed for compressing large data. It's possible to run out of disk space, and also it's possible to suffer from slowness of poorly chosen disk. Therefore ennaf does not try to be clever and guess where to put temporary files, but instead requires the user to choose the location. It can be specified in environment, or in a '--temp-dir' option. If neither environment nor command line specifies the temporary directory, I don't consider it safe to proceed with compression. Maybe it's too strict, and possibly, the current directory can be used in such case?

The second point about auto-creating the temporary directory. I never thought about it actually. I guess it could be useful. But then it would probably need to also auto-delete the directory when compression is done. It does not seem like it's worth the added complexity and risks. If the environment has no TMP nor TMPDIR specified, and you don't feel like setting up a dedicated directory, the easiest is to just use --temp-dir .. Which could possibly be the default behavior, I'll think about it.

By the way, probably you are aware of this (but i didn't know)... Here it goes:

cmp OUT VDB.fa OUT VDB.fa differ: byte 8963858, line 126912

This is probably expected. unnaf always produces a well-formed FASTA output, which means that all lines wrap at the same length. If the input was not wrapped at same length (not well-formed), like in this case, then the compression/decompression will be lossy. It's lossless only for well-formed inputs. In your example, AB290917.1 is wrapped at length 60 in the original file. But after compression/decompression, the entire file is line-wrapped at length 70 (length of longest line in the input). I think this part is not explained in the manual, I will clarify it.

Thanks!

Best, Kirill

pratas commented 2 years ago

Hi Kirill,

Thank you for your kind words as well! They also mean a lot from you. Regarding the --temp-dir makes sense and now I got the idea. Regarding the break lines with different sizes for biological purposes there is no problem the only think it would be if someone is benchmark NAF may have the wrong idea that it is not lossless. By the way, have you ever though in sorting or combining with an existing tool (for example: seqkit sort) the reads in a multi-FASTA? For large files it may bring substantial gains, for example in storing large reference databases. Thanks!

Best regards, Diogo

KirillKryukov commented 2 years ago

Hi Diogo,

Regarding the break lines with different sizes for biological purposes there is no problem the only think it would be if someone is benchmark NAF may have the wrong idea that it is not lossless.

Yeah, I agree, it can be confusing or unexpected. NAF is lossless only on well-formed FASTA and FASTQ, I will clarify it in the manual soon.

EDIT: Added: https://github.com/KirillKryukov/naf/blob/master/Compress.md#is-naf-lossless

By the way, have you ever though in sorting or combining with an existing tool (for example: seqkit sort) the reads in a multi-FASTA? For large files it may bring substantial gains, for example in storing large reference databases.

Haven't thought much about it. I mainly use NAF for individual genomes, for collections of genomes of same or closely related species, and for single gene datasets. In all these cases there should be no particular benefit from reordering the sequences. Reordering may be useful for storing an initially unsorted collection of viral or bacterial genomes, where sorting will group similar genomes together. I haven't experimented with different sorting strategies, so not sure what would work best. It could be interesting to try compressing, e.g., the entire nt or nr database with different sorting schemes.

Where reordering can be very useful is when compressing raw reads in FASTQ format. There are already FASTQ compression methods involving reordering (SCALCE, BEETL, MINCE, ORCOM, HARC, Assembltrie, FaStore, BdBG, FastqCLS, possibly more). I'd like to check if there is a standalone read sorter that can be used as preprocessor before compression, as currently I'm not sure if I want to add such sorter into ennaf itself. However, sequence content of a typical NGS read set already compresses well (with NAF or other methods). It's the qualities that take up the most space in compressed FASTQ files. Therefore my probable next step is to add some sort of quality quantizer, either into ennaf, or as standalone tool. Possibly such quantizer may already exist, not sure.

Best regards, Kirill