aehrc / isling

A tool for detection of viral integrations
5 stars 1 forks source link

Isling

Docker Image CI

Isling is a tool for detecting viral or vector integration in paired-end reads. Please read our paper for full details.

Quickstart

If you have conda and snakemake installed, to run with the (included) test data locally:

git clone https://github.com/aehrc/isling.git && cd isling
snakemake --configfile test/config/test.yml --cores <cores> --use-conda

If you have snakemake and singularity installed, you can use instead:

snakemake --configfile test/config/test.yml --cores <cores> --use-singularity

Alternatively, if you have docker installed, on MacOS you can run:

docker run --rm -it -v"$(pwd)"/out:/opt/isling/out szsctt/isling:latest snakemake --configfile test/config/test.yml --cores 1

This will use the config file and data inside the container, and the results will appear in a folder called out in your current working directory. On Linux, you will need to run this command as root, and on Windows you will need to adjust the bind-mount syntax (-v argument).

The input data (reads and host and viral references) are specified ni a config file - for your own data, you'll need to modify the example config file (test/config/test.yml) to point to your own data. See configfile.md for more information about the format of the config file.

Overview

The pipeline performs several steps in order to identify integration sites. It takes as input datasets consisting of either fastq files or bam files. It does some pre-processing of the reads (merging overlapping reads, optional) and then aligns them to both a host and a viral sequence. Reads are first aligned to the viral sequence(s), and then aligned reads are extracted and aligned to the host. These alignments are used to identify viral integrations.

Dependencies

Isling requires snakemake and either singularity (recommended) or conda to supply dependencies. Additionaly, python version 3.5 or above and pandas are required (these should be automatically installed if installing snakemake with conda.

Alternativley, use the Docker version which contains isling and all dependencies.

Inputs

The required inputs are the config file, which specfies the host and viral/vector references, and reads are required. Specify all inputs in a config file. Isling currently only works for paired-end reads.

See the file configfile.md for a description of the format of this config file.

Outputs

Isling outputs integration sites in a tab-separated format in the output directory specified in the config file.

Files

Within the output folder, one folder is created for each dataset in the config file, and for each dataset, integrations can be found in the ints directory. There will be one set of output files for each sample.

For each sample, there are a number of output files, which may be of interest for particular use-cases.

Columns

The output files give the location of the identified integrations, and their properties. Coordinates for integration junctions are specified in terms of their ambiguous bases. That is, there is often a gap or overlap between the host and viral portions of a read:

Since the location of the integration cannot be uniquely determined in this case, isling outputs the coordinates of these bases in the host and vector/virus genome as the location of the integration.

With the exception of the merged cluster output file, all files contain these columns:

Benchmarking

To reproduce the figures in the isling manuscript, see the readme in the benchmarking directory.