An integrated Genome Decontamination Pipeline for wild ciliated microeukaryotes
iGDP v1.1.0

An integrated Genome Decontamination Pipeline (iGDP) for wild ciliated microeukaryotes

iGDP can work as a "positive or negative filter" to obtain target ciliate sequences from genomic sequencing data containing various contaminants by integrating homology search, telomere reads-assisted and clustering approaches.


bwa (>=v0.7.17)

$ conda install -c bioconda bwa

samtools (>=v1.7)

$ conda install -c bioconda samtools

metabat2 (>=v2.12.1)

$ conda install -c bioconda metabat2

* ## iGDP 

$ git clone

give executable permission to all scripts in iGDP scripts directory

$ chmod a+x iGDP/scripts/*pl

add iGDP scripts directory to your PATH environment variable

$ echo 'PATH=$(pwd)/iGDP/scripts/:$PATH' >> ~/.bashrc $ source ~/.bashrc

# Download NCBI NR protein database using mmseqs

Usage: mmseqs databases [options]

Downloading NR database named with prefix 'NRdb' in your working directory using the following command

$ mmseqs databases NR NRdb tmpDir

*Tip:* You can creat your own database for homology search using ```mmseqs createdb``` module. For more details, see [mmseqs](
# Usage
## Workflow
## Run iGDP
* ### Implement homology search program

$ -i -o -d [options]

options: -i : input assembled contigs [.gz or uncompressed] -o : output directory [e.g. homology_search] -d : database for mmseqs search -rank [optional]: target taxonomic space of homology search [format, rank:taxon; rank must be phylum/class/order/family/genus/species and taxon begins
with a capital letter; default: phylum:Ciliophora] -b [optional]: bin size [contig is cut to -b bp for homology search; default: 1000] -s [optional]: mmseqs seach sensitivity [1.0 faster; 4.0 fast; 7.5 sensitive; default: 5.7] -t [optional]: number of threads used for mmseqs [default: 72] -T [optional]: translation table of the target genome [default: 6 for ciliates]

* ### Implement telomere reads-assisted program

$ -i -o -r1 -r2 [options]

options: -i : input assembled contigs [.gz or uncompressed] -o : output directory [e.g. telomere_reads] -r1 : read1 input file name [.gz or uncompress] -r2 : read2 input file name [.gz or uncompress] -u [optional]: 5' to 3' telomeric repeat unit of the target genome [default: CCCCAA for Tetrahymena species] -b [optional]: threads for bwa mem [default: 8] -s [optional]: threads for samtools view [default: 8]

* ### Implement clustering program

$ -i -o -r1 -r2 [options]

options: -i : input assembled contigs [.gz or uncompressed] -o : output directory [e.g. clustering] -r1 : read1 input file name [.gz or uncompress] -r2 : read2 input file name [.gz or uncompress] -b [optional]: threads for bwa mem [default: 8] -s [optional]: threads for samtools view [default: 8]

*Tip:* Running `` must be after implementing `` and `` programs.

# An example of running iGDP
## Positive filtering mode (default)
This mode directly selects ciliate sequences as the target genome.

Please enter the `iGDP/` directory after downloading iGDP and NR protein database.
You will see three files in the `example/` directory:

* The file `assemly.fa.gz` is a contaminated genome assembly.  
* The files `read1.fq.gz` and `read2.fq.gz` are paired-end short-read sequencing data for the above genome.

Enter the `example/` directory and implement the following command lines:

$ -i assemly.fa.gz -o homology_search -d {path_to_NR}/NRdb $ -i assemly.fa.gz -o telomere_reads -r1 read1.fq.gz -r2 read2.fq.gz $ -i assemly.fa.gz -o clustering -r1 read1.fq.gz -r2 read2.fq.gz

Then the follwong data files will be created and deposited in the `example/` directory:  
* The files `homology_search.homology.recall.contigs`, `telomere_reads.telo_reads.recall.contigs` and `clustering.contigs` contain contig IDs obtained by ``, `` and `` programs, respectively;

* The folders `homology_search/`, `telomere_reads/` and `clustering/` contain intermediate data files generate by the above commands.

* The file `final_genome.fa` is the final genome after contamination removal.

## Negative filtering mode
This mode first selects sequences from all non-Ciliophora contaminants and then keep the rest as the target genome. Compared with positive filtering, the obtained genome by this mode usually has higher completeness but lower precision.

After run `` and `` as above, implement the following command line:

$ -i assemly.fa.gz -o clustering_negative -r1 read1.fq.gz -r2 read2.fq.gz

* The file `final_genome.negative.fa` is the final genome after contamination removal.

# Update
* 2022/10/14
   * intergate clustering program into iGDP
   * add `-rank` option allowing user to set the homology search space for the target species.
* 2023/01/25
   * add `negative filtering mode` into iGDP. This mode is suitable to genomic data without contamination from other ciliates such as single-cell sequencing data.