NMD workflow

This repository exists for reproducibility purpose. The data generated on this workflow powers the NMDtxDB. Raw data is available at the SRA PRJNA1054031. RNA-seq reads need to be pre-processed and alignment before input.

Workflow description

The workflow comprises two parts. The first part comprises a Snakemake workflow (workflow). The second part enables the CDS detection and integration.

Usage

Part 1

This refers to the workflow to generate the de novo transcriptome, and compute DGE and DTE.

snakemake --jobs 10 --cores 10 --profile slurm --printshellcmds --reason --use-singularity --use-conda --use-envmodule

To produce the DAG:

snakemake --rulegraph | dot -Tsvg > rulegraph.sv

Part 2

This refers to the workflow for CDS detection. Here an example using sequences trimmed by the Ensembl start codon:

awk '{ print $1 "\t" $7-1 "\t" $8 "\t" $4 "\t" 1 "\t" $6; }' GRCh38.102.gtf > ref_cds.bed

Rscript cds/StartATG_to_cDNA.R ref_cds.bed

perl longorf2_fwd_v2.pl --input GRCh38.102.fa --startcodon ref_cds_cDNA.bed > ensembl_longorf2.fa

See longorf_integration_bed12 script, which details how the multiple source integration is done.

To retrieve the other sources:

wget https://ftp.ebi.ac.uk/pub/databases/gencode/riboseq_orfs/data/Ribo-seq_ORFs.bed
https://api.openprot.org/api/2.0/HS/downloads/human-openprot-2_0-refprots+altprots+isoforms-uniprot2017_03_07.bed.zip

License

This project is licensed under the MIT.

Funding

This work was supported by the DFG Research Infrastructure West German Genome Center, project 407493903, as part of the Next-Generation Sequencing Competence Network, project 423957469.

dieterich-lab / nmd-wf

readme