Genetalks / gtz

A high performance and compression ratio compressor for genomic data, powered by GTXLab of Genetalks.
Other
170 stars 39 forks source link
bam compress compression compression-rate fastq fastq-compression gene genomic-data-compression gtx nanopore pacbio sam

GTX.Zip Professional Version (Latest Version GTZ 4.x)

Please see the GTX.Zip website, where you can download the lastest version of GTX.Zip, read user manual, ask questions, and receive technical support.This github is no longer maintained.

中文说明.

QQ group(s): 934492381

GTX.Zip QQ groups

WeChat group(s):

GTX.Zip WebChat groups

Powered by GTXLab of Genetalks.

Product Series

Product Version Description How to Get
GTX.Zip V4.x Companies, Institutions and individual users that with large local sequencing data Download

The following are outdated

Index

What is GTX.Zip?

GTX.Zip (GTZ for short) is a high performance lossless compression tool for arbitrary files, and has a particularly high compression rate for genetic data which can compress the FASTQ to 2% ( almost 1/6 ~1/8 of fastq.gz ) of the original size even at the speed of 1100MB/s for fastq file. GTX.Zip also support to recompress fastq.gz file directly.

-Back to Top-


Product Series

Product Version Description How to Get
GTX.Zip Professional V3.0.0 Companies, Institutions and individual users that with large local sequencing data Install
GTX.Zip Enterprise V1.0.1 Large-scale enterprises and data centers that with PB-level sequencing data and require distributed compression by their own computing clusters Contact Us
GTX.Zip Cloud V1.0.1 Companies that with large amounts of sequencing data distribution and storage in the cloud http://gtz.io

-Back to Top-


Supported Bioinformatic Analysis Softwares

-Back to Top-


Feature

GTX.Zip compressor system features:

-Back to Top-


System Environment Requirements

How to Install (Linux)

Please visit this website to download the installation package
www.gtxlab.com

Running the following command, the corresponding software version information appears, indicates that the installation was successful

gtz --version

-Back to Top-


Quick Start (Linux)

GTX.Zip Professional needs to be installed on the current machine. If not, please see -How to Install- .

1. Download samples file to be compressed
Sample Download: -sample.fq-

* 2GB fastq file, extracted from a real WES data produced by Novaseq

Reference genome Download: -GCF_000001405.37_GRCh38.p11_genomic.fna.gz-

2. Start compression

gtz sample.fq --ref GCF_000001405.37_GRCh38.p11_genomic.fna.gz

* gtz can also directly compress fastq.gz file

3、decompress

gtz -d sample.fq.gtz

How to use

Command navigation:

high compression rate with fasta, Decompress without using fasta anymore(recommended)

Higher compression rate, Decompress use the fasta exactly the same as compressing (Note: You and your client must properly store the fasta file for decompression in future)

compress BAM, Decompress without using fasta anymore(recommended)

Decompress use the fasta exactly the same as compressing (Note: You and your client must properly store the fasta file for decompression in future)

Lower compression rate than above, but can compress arbitrary files

Usage example:

1. Compression fastq/fastq.gz (high-power compression)

1.1 Default compression mode for fastq

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna(.gz)

The ref parameter is used to specify the reference genome fasta file for the nova.fastq.gz corresponding species, and the fasta file also supports the gz format.Note: After compression, and the fasta file is no longer needed when decompress.

1.2 Default compression mode for bam

gtz /data/nova.bam --ref /fasta/genomic.fna(.gz)

The ref parameter is used to specify the reference genome fasta file for the nova.bam corresponding species, and it's necessary. After compression, and the fasta file is no longer needed when decompress.

1.3 Specify the output file name

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna -o /out/nova.gtz

-o parameter specifies the output file name, note that the lowercase letter o

1.4 Decompression and Check after Compression Completion

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna --verify

After the data compression is completed, GTX.Zip will decompress it again to verify data MD5 and ensure that data can be fully restored. When the compressed file is used for archiving, this verify parameter isrecommended to be added. It is not necessary to add this parameter in peacetime.

1.5 Fast compression

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna -l 1

-l Parametric specified compression level. 1-5 is a fast compression mode. The current compression algorithm used in 1-5 is the same, here is mainly for future expansion. The same compression algorithm used in 6-9, which is also for future expansion. 6 is the default compression level, that is, the highest compression algorithm.

1.6 Limit compressed threads

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna -p 4

-p parameter specifies the number of threads used for compression,Here -p 4 means that only 4 threads will be used in the entire compression process, which is very useful when there are not enough computing resources.

1.7 Modify the default cache path

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna --cache-path /path/cache/

When using -- ref to specify fasta, GTZ converts FASTA to the corresponding binary file and caches it to the default path (/ home / user /. config / gtz), so that when the same FASTA is specified for the next compression, GTZ can read data directly from the cache path, which is relatively fast. You can use this parameter if you need it (for example, / home/user does not have enough space)

1.8 Do not package fasta files

gtz /data/nova.fastq.gz --ref /fasta/genomic.fna --donot-pack-ref

Using the-donot-pack-ref option, the resulting target gtz file is smaller, but the corresponding fasta needs to be specified when unzipping. We do not recommend using this option because without this option, the compression ratio has little effectcompared to this option, but if you use this option, you need to specify the appropriate fasta when unzipping and you have to ensure the fasta is properly saved in safe disk.

2. Ordinary compression

2.1 compression fastq/fastq.gz

gtz /data/nova.fastq.gz

When not using --ref to specify fasta, GTZ compresses the fastq file normally, and the compression rate of common compression is much lower than that of high compression in most cases.

2.2 Compression of non-fastq/fastq.gz files

gtz /data/test.bam

GTZ can compress any file

3. Decompression

3.1 Decompress the fastq.gtz file with FASTA compression without FASTA

gtz -d nova.fastq.gtz

Unzipping gtz files with fasta compression by default does not require fasta files, which is a feature of gtz2.x.x

3.2 Decompress the bam.gtz file with FASTA compression without FASTA

gtz -d nova.bam.gtz

gtz -d nova.bam.gtz --bam-to-sam

default, bam.gtz is unzip to bam, if want to unzip to sam, add paramters --bam-to-sam

3.3 Unzip the gtz file with fasta compression. You need to specify fasta

gtz -d nova.fastq.gtz --ref /fasta/genomic.fna(.gz)

If the compression takes fasta, and uses the-donot-pack-ref parameter, you need to specify the fasta to be used when decompressing.

3.3 Decompress to terminal

gtz -d nova.fastq.gtz -c 2>/dev/null

-C parameter represents decompression to the terminal. 2>/dev/null means that other prints are removed and only the extracted fastq content is printed.

3.4 Unzip to the specified path

gtz -d nova.fastq.gtz -O /path/out/

-O parameter specifies the output directory of the decompressed file. Note that the capital letter O

3.5 Limit decompression threads

gtz -d nova.fastq.gtz -p 4

The -P parameter is also suitable for decompression, here the -P 4 means that only four threads are used for decompression.

3.6 Unzip low version gtz files with bin compression

gtz -d nova.fastq.old.gtz -r Homo_sapiens_bcacac9064331276504f27c6cf40e580.rbin

The -r parameter is used for compatibility with GTZ version, which is lower than 2.0.0. Version 2.0 can compress any GTZ file in the old version. When the old version of GTZ compresses fastq text and uses the -b parameter to specify the corresponding bin file, then version 2.0.0 can use - r to specify the corresponding RBIN file to decompress the old version of GTZ file.

Parameter description:

--ref <string>
    @ compress : specifies Fasta corresponding to the compressed file,
    which results in a higher compression rate

    @ decompress : if compression uses Fasta and parameter
    --donot-pack-ref is used, the corresponding Fasta needs to be
    specified by this parameter when decompressing

--bam-to-sam    
        @ compress : do not use

        @ decompress : decompress bam to sam, it's valid only for BAM,
        otherwise bam decompressed to bam

-z,  --fastq-to-fastq-gz
    @ compress : do not use

    @ decompress : decompress fastq to fastq.gz, it's valid only for
    FASTQ

--cache-path <string>
    @ compress : when Fasta is specified by --ref, GTZ converts the Fasta
    to the corresponding binary file and then caches it to the default
    path, so that when the same Fasta is specified for compression, GTZ
    can read directly data from cache path, which is relatively fast.
    default cache path is /home/xuxl/.config/gtz, you can use this
    parameter to change it

    @ decompress : same as compress

--donot-pack-ref
    @ compress : this option is not recommended. By default, when
    compression uses Fasta, GTZ extracts data from the Fasta and then
    compresses it to the GTZ file, so that the resulting GTZ file is no
    longer required Fasta when decompressed. use this option or not, the
    compression rate has a low impact, but if you use this option, you
    need to specify the corresponding Fasta when decompressing

    @ decompress : do not use

--verify
    @ compress : after data compression, decompress the generated GTZ file
    again to ensure that the generated GTZ file must be decompressed.
    Usually it's not necessary, but if it's used for archiving, you can
    use this parameter.

    @ decompress : do not use

-l <number>,  --level <number>
    @ compress : [1-5] is fast compress mode, at present, 1-5 compression
    algorithm is same, here is for later expansion. 6 is default. [6-9] is
    best compress level, compression algorithm is also the same, here is
    for later expansion

    @ decompress : when the gtz in FASTQ format is decompressed to
        fastq.gz, the compression level of gz can be changed by -l/--level,
        ranging from 0-9, and default level is 4

-r <string>,  --rbin-path <string>
    @ compress : do not use

    @ decompress : use only for version less than 2.0.0, mainly for
    compatibility with older versions. when compression specifies the BIN
    file, You can use this parameter to specify the corresponding RBIN
    file or the directory in which the RBIN file is located to
    decompress.

-O <string>,  --out-dir <string>
    @ compress : do not use

    @ decompress : specify the save directory of decompression file

-f,  --force
    @ compress : force overwrite of output file

    @ decompress : same as compress

-c,  --stdout
    @ compress : do not use

    @ decompress : decompression output to terminal

-d,  --decompress
    @ compress : do not use

    @ decompress : specify the GTZ file to decompress

-p <number>,  --parallel <number>
    @ compress : specify parallel number, default is CPU logical cores

    @ decompress : same as compress

-o <string>,  --out <string>
    @ compress : specify output GTZ file name                 

    @ decompress : do not use

-e,  --no-keep
    @ compress : don't keep input files                 

    @ decompress : do not use

--version
    Displays version information and exits.

-h,  --help
    Displays usage information and exits.

<file name>
    input file name

GTX Lab Compressor

-Back to Top-

Ecology Softwares

1、BWA for GTZ (support gtz 2.x.x version)

2、BCL2FASTQ for GTZ (support gtz 3.0.1)

3、STAR for GTZ (support gtz version < 2.x.x)

The official website STAR directly supports the GTZ format, after the installation of GTZ and STAR,

4、BOWTIE for GTZ (support gtz 2.x.x version)

5、BOWTIE2 for GTZ (support gtz 2.x.x version)

6、TOPHAT for GTZ (support gtz version < 2.x.x)

7、HISAT2 for GTZ (support gtz version < 2.x.x)

8、MEGAHIT for GTZ (support gtz version < 2.x.x)

9、FASTQC for GTZ (support gtz version < 2.x.x)

10、FASTP for GTZ (support gtz 2.x.x version)

11、MINIMAP2 for GTZ (support gtz version < 2.x.x)

12、WTDBG2 for GTZ (support gtz version < 2.x.x)

13、BWA-MEM2 for GTZ (support gtz version 3.0.1)

-Back to Top-


Change Log

Current Latest Version:gtz-3.0.1 [2021/04/06]

historical version: -Change Log-

FAQ

Frequently Asked Questions are intended to help newcomers to understand how we work! -Click here!-

Contact Us

If you have any questions, feel free to contact: contact@gtz.io, or create a new GitHub issue .

-Back to Top-


License

See LICENSE for details.