dieterich-lab / nmd-wf

MIT License
0 stars 0 forks source link

NMD workflow

License GitHub issues

This repository exists for reproducibility purpose. The data generated on this workflow powers the NMDtxDB. Raw data is available at the SRA PRJNA1054031. RNA-seq reads need to be pre-processed and alignment before input.

Workflow description

The workflow comprises two parts. The first part comprises a Snakemake workflow (workflow). The second part enables the CDS detection and integration.

Usage

Part 1

This refers to the workflow to generate the de novo transcriptome, and compute DGE and DTE.

snakemake --jobs 10 --cores 10 --profile slurm --printshellcmds --reason --use-singularity --use-conda --use-envmodule

To produce the DAG:

snakemake --rulegraph | dot -Tsvg > rulegraph.sv

Part 2

This refers to the workflow for CDS detection. Here an example using sequences trimmed by the Ensembl start codon:

awk '{ print $1 "\t" $7-1 "\t" $8 "\t" $4 "\t" 1 "\t" $6; }' GRCh38.102.gtf > ref_cds.bed

Rscript cds/StartATG_to_cDNA.R ref_cds.bed

perl longorf2_fwd_v2.pl --input GRCh38.102.fa --startcodon ref_cds_cDNA.bed > ensembl_longorf2.fa 

See longorf_integration_bed12 script, which details how the multiple source integration is done.

To retrieve the other sources:

wget https://ftp.ebi.ac.uk/pub/databases/gencode/riboseq_orfs/data/Ribo-seq_ORFs.bed
https://api.openprot.org/api/2.0/HS/downloads/human-openprot-2_0-refprots+altprots+isoforms-uniprot2017_03_07.bed.zip

License

This project is licensed under the MIT.

Funding

This work was supported by the DFG Research Infrastructure West German Genome Center, project 407493903, as part of the Next-Generation Sequencing Competence Network, project 423957469.