Transipedia / dekupl-run

Identify differentially expressed k-mers between RNA-Seq datasets
MIT License
11 stars 11 forks source link

dekupl-annot-logo

pipeline status docker pull conda install

DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets under The MIT License.

Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.

Usage

Dekupl-run is a pipeline built with Snakemake. It works with a configuration file that you will use to set the list of samples and their conditions as well as parameters for the test.

  1. Create a config.json with the list of your samples, their conditions and the location of their FASTQ file. See next section for parameter description.

  2. Run the pipeline. Replace CONFIG_JSON with the config file you have created, NB_THREADS with the number of threads and MAX_MEMORY with the maximum memory (in Megabyte) you want DE-kupl to allocate. This command line can varry depending of the installation (docker, singularity, manual, etc). dekupl-run --configfile CONFIG_JSON -jNB_THREADS --resources ram=MAX_MEMORY -p

  3. Explore results. Once Dekupl-run has been successfully executed, DE contigs produced by Dekupl-run are located under DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz. They can be annoted using Dekupl-annotation and vizualized with Dekupl-viewer.

Installation

We recommand tu use singularity to install dekupl-run, but you can also use Docker, and manual installation.

Option 1: Use dekupl-run with singularity

One can create a singularity container from the docker image. Two methods are available, they should both work.

It's advised to mount some volumes (input/output directories). To mount the "/store" volume you should use "--bind /store:/store". That way, you can access the /store directory (in your configuration file, notably). Make sure your config.json is in the same folder as dekupl-run.simg.

Option 2: Use dekupl-run with Docker

Option 3: Build and run yourself (not recommended)

Configuration

Config file structure

Here is an example of a minimal config file with only mandatory information. You can copy this base and adapt it to your needs (see following paragraphs).

The parameter samples containing the list of samples with their associated conditions can be replaced with a TSV file using the samples_tsv option (see below).

Note : even though an arbitrary config file name can be specified on the command line (using --configfile), a non-empty file named ‘config.json’ must be present in the current directory. ‘config.json’ will be overriden by the name specified on the command line.

{
  "fastq_dir": "data",

  "dekupl_counter": {
    "min_recurrence": 2,
    "min_recurrence_abundance": 5
  },

  "diff_analysis": {
    "condition" : {
      "A": "A",
      "B": "B"
    },
    "pvalue_threshold": 0.05,
    "log2fc_threshold": 2
  },

  "samples": [{
      "name": "sample1",
      "condition": "A"
    }, {
      "name" : "sample2",
      "condition" : "A"
    }, {
      "name" : "sample3",
      "condition" : "B"
    }, {
      "name" : "sample4",
      "condition" : "B"
    }
  ]
}

Parameters FAQ

How can I use DEkupl-run with non-human data ? You need to specify your own FASTA using the transcript_fasta option as well as file with mapping of transcript_id to gene_id with the transcript_to_gene option.

How can I use DEkupl-run with single-end reads? Set parameter lib_type to "single". You can also specify fragments length (see section Configuration for single-end libraries)

General configuration parameters

Configuration for single-end libraries

For single-end libraries please specify the following parameters :

Notes : The fastq files for single-end samples will be located using the following path : {fastq_dir}/{sample_name}.fastq.gz If present, parameters r1_suffix and r2_suffix will be ignored.

Output files

The output directory of a DE-kupl run will have the following content :

├── {A}_vs_{B}_kmer_counts
│   ├── diff-counts.tsv.gz
│   ├── merged-diff-counts.tsv.gz
├── gene_expression
│   ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│   ├── normalization_factors.tsv
│   ├── raw-counts.tsv.gz
│   ├── noGENCODE-counts.tsv.gz
│   ├── {sample}.jf
│   ├── {sample}.txt.gz
│   ├── ...
├── metadata
│   ├── sample_conditions.tsv
│   ├── sample_conditions_full.tsv

The following table describes the output files produced by DE-kupl :

FileName Description
diff-counts.tsv.gz Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES].
merged-diff-counts.tsv.gz Contains assembled k-mers from diff-counts.tsv.gz. Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES].
raw-counts.tsv.gz Containins raw k-mer counts of all libraries that have been filtered with the reccurence filters.
noGENCODE-counts.tsv.gz Contains k-mer counts filtered from raw-counts.tsv with k-mers from the reference transcripts (ex: GENCODE by default).
sample_conditions_full.tsv Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample

Notes : For limma-voom in k-mer statistical method, meanA and meanB are in CPM (counts per million).

Whole-genome data

It is now possible to run DE-kupl-style analysis on whole-genome data, i.e. without using a reference transcriptome. To do so, please change data_type to WGS in config.json.

FAQ