iGDP can work as a "positive or negative filter" to obtain target ciliate sequences from genomic sequencing data containing various contaminants by integrating homology search, telomere reads-assisted and clustering approaches.
# mmseqs2 (>=v13.45111)
$ conda install -c bioconda mmseqs2
$ conda install -c bioconda bwa
$ conda install -c bioconda samtools
$ conda install -c bioconda metabat2
* ## iGDP
$ git clone https://github.com/GWang2022/iGDP.git
$ chmod a+x iGDP/scripts/*pl
$ echo 'PATH=$(pwd)/iGDP/scripts/:$PATH' >> ~/.bashrc $ source ~/.bashrc
# Download NCBI NR protein database using mmseqs
$ mmseqs databases NR NRdb tmpDir
*Tip:* You can creat your own database for homology search using ```mmseqs createdb``` module. For more details, see [mmseqs](https://github.com/soedinglab/MMseqs2).
# Usage
## Workflow
<div align=center>
<img src = "https://user-images.githubusercontent.com/107245708/215422762-7d1dac72-a9cc-47d9-a1df-43a38060531e.png", width = "600">
</div>
## Run iGDP
* ### Implement homology search program
$ iGDP_homology_search.pl -i
options:
-i
with a capital letter; default: phylum:Ciliophora]
-b [optional]: bin size [contig is cut to -b bp for homology search; default: 1000]
-s [optional]: mmseqs seach sensitivity [1.0 faster; 4.0 fast; 7.5 sensitive; default: 5.7]
-t [optional]: number of threads used for mmseqs [default: 72]
-T [optional]: translation table of the target genome [default: 6 for ciliates]
* ### Implement telomere reads-assisted program
$ iGDP_telomere_reads.pl -i
options:
-i
* ### Implement clustering program
$ iGDP_clustering.pl -i
options:
-i
*Tip:* Running `iGDP_clustering.pl` must be after implementing `iGDP_homology_search.pl` and `iGDP_telomere_reads.pl` programs.
# An example of running iGDP
## Positive filtering mode (default)
This mode directly selects ciliate sequences as the target genome.
Please enter the `iGDP/` directory after downloading iGDP and NR protein database.
You will see three files in the `example/` directory:
* The file `assemly.fa.gz` is a contaminated genome assembly.
* The files `read1.fq.gz` and `read2.fq.gz` are paired-end short-read sequencing data for the above genome.
Enter the `example/` directory and implement the following command lines:
$ iGDP_homology_search.pl -i assemly.fa.gz -o homology_search -d {path_to_NR}/NRdb $ iGDP_telomere_reads.pl -i assemly.fa.gz -o telomere_reads -r1 read1.fq.gz -r2 read2.fq.gz $ iGDP_clustering.pl -i assemly.fa.gz -o clustering -r1 read1.fq.gz -r2 read2.fq.gz
Then the follwong data files will be created and deposited in the `example/` directory:
* The files `homology_search.homology.recall.contigs`, `telomere_reads.telo_reads.recall.contigs` and `clustering.contigs` contain contig IDs obtained by `iGDP_homology_search.pl`, `iGDP_telomere_reads.pl` and `iGDP_clustering.pl` programs, respectively;
* The folders `homology_search/`, `telomere_reads/` and `clustering/` contain intermediate data files generate by the above commands.
* The file `final_genome.fa` is the final genome after contamination removal.
## Negative filtering mode
This mode first selects sequences from all non-Ciliophora contaminants and then keep the rest as the target genome. Compared with positive filtering, the obtained genome by this mode usually has higher completeness but lower precision.
After run `iGDP_homology_search.pl` and `iGDP_telomere_reads.pl` as above, implement the following command line:
$ iGDP_clustering_negative.pl -i assemly.fa.gz -o clustering_negative -r1 read1.fq.gz -r2 read2.fq.gz
* The file `final_genome.negative.fa` is the final genome after contamination removal.
# Update
* 2022/10/14
* intergate clustering program into iGDP
* add `-rank` option allowing user to set the homology search space for the target species.
* 2023/01/25
* add `negative filtering mode` into iGDP. This mode is suitable to genomic data without contamination from other ciliates such as single-cell sequencing data.