Kari-Genomics-Lab/iDeLUCS

drawing

An interactive deep-learning based tool for clustering genomic sequences.

Installation for the CLI Application:

The recommended installation process is via terminal and it requires pip installed. Follow the steps after making sure you have it installed on your machine.

(Optional) Create and activate a new virtual environment using venv

$ python3 -m venv dev_iDeLUCS
$ source dev_iDeLUCS/bin/activate

or if you are on Windows:

python3 -m venv dev_iDeLUCS
dev_iDeLUCS/Scripts/activate

Clone this reposotory

git clone https://github.com/millanp95/iDeLUCS.git iDeLUCS

Install the package

If you do not plan to use the GUI you can also install the package directly from Pypi:

pip install idelucs

or you can build it from source:

cd iDeLUCS
python setup.py build_ext --inplace
pip install -e .

Test installation
```
idelucs -h  
```

Installation for the GUI:

pip install PyQt5

If you are comfortable with executing the script from the terminal, no further steps are needed. If you do not want to execute the tool form the terminal, you may download the executable files here

Note: iDeLUCS uses PyTorch as the machine learning development framework and its latest stable release might not be compatible with your version of CUDA. The GUI is built using the Qt platform, make sure you have Qt platform plugings installed.

(Optional) Using the GUI on Apple Silicon-based Macs

The current pip distribution of PyQt5 does not support Apple Silicon. However, to extend the functionality of iDeLCUS GUI to MAC users (Thanks to the works of Shane Ding), it is possible to use the brew distribution of PyQt5 and then point to that distribution from within the virtualenv.

Ensure that homebrew is installed correctly.
brew install PyQt5
Run brew --prefix PyQt5 to see where PyQt5 is installed.
Append /lib/python3.9/site-packages/PyQt5 to the end of the string returned from the previous command.
Navigate to the root iDeLUCS folder.
Establish a symbolic link between the brew installed PyQt5 package and the virtual environment's site-packages folder.
- Example: ln -s /opt/homebrew/opt/pyqt@5/lib/python3.9/site-packages/PyQt5 /Users/shaneding/Desktop/iDeLUCS/dev_iDeLUCS/lib/python3.10/site-packages

If the above does not work, try seeing what different versions of python are inside /opt/homebrew/opt/pyqt@5/lib and try linking to a different version.

Running iDeLUCS

We provide several datasets to test the perfomance of iDeLUCS at finding meaningful genomic signatures for organisms in different kingdoms and at different taxonomic levels. But the process is similar on the user's own data.

Clustering a dataset with grouth truth:

idelucs --sequence_file=Example/Vertebrata.fas --GT_file=Example/Vertebrata_GT.tsv --n_epochs=35 --lambda=2.8 --k=6 --n_clusters=5 --n_mimics=3 --batch_sz=512

Clustering an unkown dataset:

idelucs --sequence_file=Example/Actinopterygii.fas --n_epochs=30 --k=6 --n_clusters=3 --n_mimics=8 --batch_sz=256

Running iDeLUCS GUI from terminal:

cd iDeLCUS/GUI
python GUI_iDeLUCS.py

Clustering parameters

iDeLUCS assigns a cluster identifier to all the DNA sequences present in a sigle FASTA file. The path to this file must be provided as input in both the CLI and the GUI versions of iDeLUCS. There are several hyperparameters that are required to perform the clustering. The user may use the default values or select a specific one depending on the amount of information that is available about the dataset.

Argument Name	Variable Type	Description	Options
	string	Path for single Fasta file with training sequences	Required to run the program.
--GT_file	string	Path for a tab-separated file with possible labels for the training dataset. These labels won't be used during training, just for posthoc analysis. The GT file must contain the columns: `sequence_id` with the sequence identifiers, these identifiers must correspond to the identifiers provided in the `<sequence_file>`. The column `cluster_id` with the "ground truth" assignments for each sequence. Additionally, the user may include an additional column with alternative labeling under the header `phylogeny`	Default = None.
--k	integer	k-mer length	Default = 6, Options:`[4,5,6]`
--n_clusters	integer	Expected or maximum number of clusters to find. It should be at least `n_true_clusters` when `GT_file` is provided'.Use 0 to launch iDeLCUS + HDBSCAn to automatically find the number of clusters in fine-grained clustering.	Default = 5 ; Range: 0, 2-199.
--n_epochs	integer	Number of training epochs. An epoch is defined as a training iteration over all the training pairs.'	Default: `50` ; Recommended Range: `[50,150]`
--n_mimics	integer	Number of data augmentations per sequence that will be considered during training	Default = 50; Recommended Range: 50-150
--batch_sz	integer	Number of data pairs the network will receive simultaneously during training. A larger batch may improve convergence but it may harm the accuracy. It must be a multiple of `n_mimcs`.	Default: 3; Recommended Range: 0-600. Note: This value might be limited by the capacity of your machine.
--optimizer	string	Optimization algorithm to train the neural network	`['SGD', 'Adam','RMSprop']`
--lambda	float	Hyperparameter to control cluster balance.	Default = 2.8; Recommended Range 1 (highly imbalanced dataset) - 3 (perfectly balanced dataset)
--scheduler	string	Learning rate schedule used to train the neural network	`['None', 'Plateau','Triangle']`