DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets under The MIT License.
Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.
Dekupl-run is a pipeline built with Snakemake. It works with a configuration file that you will use to set the list of samples and their conditions as well as parameters for the test.
Create a config.json with the list of your samples, their conditions and the location of their FASTQ file. See next section for parameter description.
Run the pipeline. Replace CONFIG_JSON
with the config file you have created, NB_THREADS
with the number of threads and MAX_MEMORY
with the maximum memory (in Megabyte) you want DE-kupl to allocate. This command line can varry depending of the installation (docker, singularity, manual, etc).
dekupl-run --configfile CONFIG_JSON -jNB_THREADS --resources ram=MAX_MEMORY -p
Explore results. Once Dekupl-run has been successfully executed, DE contigs produced by Dekupl-run
are located under DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz
. They can be annoted using Dekupl-annotation and vizualized with Dekupl-viewer.
We recommand tu use singularity to install dekupl-run, but you can also use Docker, and manual installation.
One can create a singularity container from the docker image. Two methods are available, they should both work.
singularity build dekupl-run.simg docker://transipedia/dekupl-run:1.3.5
It's advised to mount some volumes (input/output directories). To mount the "/store" volume you should use "--bind /store:/store". That way, you can access the /store directory (in your configuration file, notably). Make sure your config.json is in the same folder as dekupl-run.simg.
singularity run --bind /store:/store ./dekupl-run.simg --configfile config.json -jNB_THREADS
docker pull transipedia/dekupl-run:1.3.5
my-config.json
to /dekupl/my-config.json
config.json
) to /dekupl/FASTQ_DIR
config.json
) to /dekupl/OUTPUT_DIR
config.json
docker run --rm -v ${PWD}/my-config.json:/dekupl/my-config.json \
-v ${PWD}/data:/dekupl/data -v ${PWD}/results:/dekupl/results \
transipedia/dekupl-run --configfile my-config.json \
-jNB_THREADS --resources ram=MAX_MEMORY -p
Rscript install_r_packages.R
git clone --recursive https://github.com/Transipedia/dekupl-run.git
snakemake -jNB_THREADS --resources ram=MAX_MEMORY -p
Here is an example of a minimal config file with only mandatory information. You can copy this base and adapt it to your needs (see following paragraphs).
The parameter samples
containing the list of samples with their associated conditions can be replaced with a TSV file using the samples_tsv
option (see below).
Note : even though an arbitrary config file name can be specified on the command line (using --configfile), a non-empty file named ‘config.json’ must be present in the current directory. ‘config.json’ will be overriden by the name specified on the command line.
{
"fastq_dir": "data",
"dekupl_counter": {
"min_recurrence": 2,
"min_recurrence_abundance": 5
},
"diff_analysis": {
"condition" : {
"A": "A",
"B": "B"
},
"pvalue_threshold": 0.05,
"log2fc_threshold": 2
},
"samples": [{
"name": "sample1",
"condition": "A"
}, {
"name" : "sample2",
"condition" : "A"
}, {
"name" : "sample3",
"condition" : "B"
}, {
"name" : "sample4",
"condition" : "B"
}
]
}
How can I use DEkupl-run with non-human data ?
You need to specify your own FASTA using the transcript_fasta
option as well as file with mapping of transcript_id to gene_id with the transcript_to_gene
option.
How can I use DEkupl-run with single-end reads?
Set parameter lib_type
to "single". You can also specify fragments length (see section Configuration for single-end libraries)
rf
). Specify either rf
for reverse-forward strand-specific libraries, fr
for strand-specific forward-reverse, or unstranded
for unstranded libraries.DEkupl_result
)../
aka current directory)r2_suffix
for the second FASTQ.name
and a
condition
. The FASTQ files for a sample will be located using the following
command fastq_dir/sample_name_{1,2}.fastq.gz
.
You can also provide a TSV file with your samples and conditions with the samples_tsv parameter (see below)."ref_masking":transcriptome_masking.fa
"ref_kallisto":transciptome_kallisto.fa
mask
). Set nomask
will skip the masking step.For single-end libraries please specify the following parameters :
single
in the case of single-end strand-specific library or unstranded
for single-end unstranded libraries.200
.30
.Notes :
The fastq files for single-end samples will be located using the following path : {fastq_dir}/{sample_name}.fastq.gz
If present, parameters r1_suffix and r2_suffix will be ignored.
The output directory of a DE-kupl run will have the following content :
├── {A}_vs_{B}_kmer_counts
│ ├── diff-counts.tsv.gz
│ ├── merged-diff-counts.tsv.gz
├── gene_expression
│ ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│ ├── normalization_factors.tsv
│ ├── raw-counts.tsv.gz
│ ├── noGENCODE-counts.tsv.gz
│ ├── {sample}.jf
│ ├── {sample}.txt.gz
│ ├── ...
├── metadata
│ ├── sample_conditions.tsv
│ ├── sample_conditions_full.tsv
The following table describes the output files produced by DE-kupl :
FileName | Description |
---|---|
diff-counts.tsv.gz |
Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES] . |
merged-diff-counts.tsv.gz |
Contains assembled k-mers from diff-counts.tsv.gz . Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES] . |
raw-counts.tsv.gz |
Containins raw k-mer counts of all libraries that have been filtered with the reccurence filters. |
noGENCODE-counts.tsv.gz |
Contains k-mer counts filtered from raw-counts.tsv with k-mers from the reference transcripts (ex: GENCODE by default). |
sample_conditions_full.tsv |
Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample |
Notes : For limma-voom in k-mer statistical method, meanA and meanB are in CPM (counts per million).
It is now possible to run DE-kupl-style analysis on whole-genome data, i.e. without using a reference transcriptome.
To do so, please change data_type
to WGS
in config.json
.
metadata
folder in order to force SnakeMake to re-make all targets that depends on this filewhich Rscript
and which R
and make sure they point to the same installation of R.brew install coreutils
. This package provide Linux versions of famous Unix command like "sort", "join", etc.