NCI-CGR / Gencode_microRNA-seq

microRNA-seq workflow utilizing STAR to generate a Sample-Gene read count matrix
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Genome fasta and GTF files #2

Open komaljain3 opened 1 year ago

komaljain3 commented 1 year ago

In the pipeline developed in 2018, the genome file used is:

/fdb/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa

And the GTF annotation file is:

ENCFF628BVT.gtf

The same gtf file as above can be downloaded from Encode website here:

https://www.encodeproject.org/files/ENCFF628BVT/

There are 12,279 miRNA entries in this gtf file and three types of genomic features: gene, transcript and exons:

Screenshot 2023-06-07 at 3 36 59 PM
komaljain3 commented 1 year ago

Genome References

The latest GTF (V43) references from ENCODE are available through here:

https://www.gencodegenes.org/human/

We downloaded the top-level, soft-masked fasta and the gtf file from Ensembl.

https://useast.ensembl.org/info/data/ftp/index.html

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

EXAMPLES

Toplevel sequences unmasked: Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Toplevel soft/hard masked sequences: Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

Sequence Type

Download Files

# Fasta File
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz

# Annotations
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr.gtf.gz 
komaljain3 commented 1 year ago

The genome assembly downloaded from ENSEMBL corresponds to GenBank Assembly ID GCA_000001405.28 which is GRCh38.p13. However, GRCh38.p13 has been replaced by GRCh38.p14

https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/README

Screenshot 2023-06-07 at 4 39 21 PM Screenshot 2023-06-07 at 4 39 32 PM

The genome assembly GRCh38.p14 can be downloaded from here from GenBank:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/

The GenBank fasta headers look like this:

 jaink4$ grep ">" GCF_000001405.40_GRCh38.p14_genomic.fna
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
>NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG1_UNLOCALIZED
>NT_187362.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG2_UNLOCALIZED
>NT_187363.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG3_UNLOCALIZED
>NT_187364.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG4_UNLOCALIZED
>NT_187365.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG5_UNLOCALIZED
>NT_187366.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG6_UNLOCALIZED
>NT_187367.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG7_UNLOCALIZED
>NT_187368.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG8_UNLOCALIZED
>NT_187369.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG9_UNLOCALIZED
>NC_000002.12 Homo sapiens chromosome 2, GRCh38.p14 Primary Assembly
>NT_187370.1 Homo sapiens chromosome 2 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR2_RANDOM_CTG1
>NT_187371.1 Homo sapiens chromosome 2 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR2_RANDOM_CTG2
>NC_000003.12 Homo sapiens chromosome 3, GRCh38.p14 Primary Assembly
>NT_167215.1 Homo sapiens chromosome 3 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR3UN_CTG2
>NC_000004.12 Homo sapiens chromosome 4, GRCh38.p14 Primary Assembly
>NT_113793.3 Homo sapiens chromosome 4 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR4_RANDOM_CTG4
>NC_000005.10 Homo sapiens chromosome 5, GRCh38.p14 Primary Assembly

Using the Ensembl genome fasta the headers look like this:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ grep ">" Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>2 dna_sm:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>3 dna_sm:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna_sm:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna_sm:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna_sm:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna_sm:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna_sm:chromosome chromosome:GRCh38:8:1:145138636:1 REF

The gtf from Encode has the format:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head ENCFF628BVT.gtf
chr1    ENSEMBL gene    17369   17436   .   -   .   gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1    ENSEMBL transcript  17369   17436   .   -   .   gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1    ENSEMBL exon    17369   17436   .   -   .   gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1    ENSEMBL gene    30366   30503   .   +   .   gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1    ENSEMBL transcript  30366   30503   .   +   .   gene_id "ENSG00000274890.1"; transcript_id "ENST00000607096.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-2-201"; level 3; tag "basic"; transcript_support_level "NA";

We can do a quick check to see the ENSEMBL and GenBank references. They should be the same. ENSEMBL chromosome names appear to be more appropriate with the GTF from Encode.

komaljain3 commented 1 year ago

GTF format from ENSEMBL before and after subsetting miRNA sequences.

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head Homo_sapiens.GRCh38.109.gtf
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession GCA_000001405.28
#!genebuild-last-updated 2022-11
1   ensembl_havana  gene    1471765 1497848 .   +   .   gene_id "ENSG00000160072"; gene_version "20"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1   ensembl_havana  transcript  1471765 1497848 .   +   .   gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1   ensembl_havana  exon    1471765 1472089 .   +   .   gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; exon_id "ENSE00003889014"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1   ensembl_havana  CDS 1471885 1472089 .   +   0   gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; protein_id "ENSP00000500094"; protein_version "1"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
1   ensembl_havana  start_codon 1471885 1471887 .   +   0   gene_id "ENSG00000160072"; gene_version "20"; transcript_id "ENST00000673477"; transcript_version "1"; exon_number "1"; gene_name "ATAD3B"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ATAD3B-206"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS30"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select";
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ 
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ 
jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head Homo_sapiens.GRCh38.109.miRNA.gtf
1   mirbase gene    187891  187958  .   -   .   gene_id "ENSG00000273874"; gene_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA";
1   mirbase transcript  187891  187958  .   -   .   gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase exon    187891  187958  .   -   .   gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; exon_number "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00003737837"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase gene    5564071 5564143 .   +   .   gene_id "ENSG00000264341"; gene_version "1"; gene_source "mirbase"; gene_biotype "miRNA";
1   mirbase transcript  5564071 5564143 .   +   .   gene_id "ENSG00000264341"; gene_version "1"; transcript_id "ENST00000579887"; transcript_version "1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase exon    5564071 5564143 .   +   .   gene_id "ENSG00000264341"; gene_version "1"; transcript_id "ENST00000579887"; transcript_version "1"; exon_number "1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00002721598"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase gene    5862672 5862741 .   -   .   gene_id "ENSG00000264101"; gene_version "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA";
1   mirbase transcript  5862672 5862741 .   -   .   gene_id "ENSG00000264101"; gene_version "1"; transcript_id "ENST00000582517"; transcript_version "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR4689-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase exon    5862672 5862741 .   -   .   gene_id "ENSG00000264101"; gene_version "1"; transcript_id "ENST00000582517"; transcript_version "1"; exon_number "1"; gene_name "MIR4689"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR4689-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00002689481"; exon_version "1"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
1   mirbase gene    18883202    18883275    .   -   .   gene_id "ENSG00000265606"; gene_version "1"; gene_name "MIR4695"; gene_source "mirbase"; gene_biotype "miRNA";
komaljain3 commented 1 year ago

NCBI reference vs. Ensembl reference:

NCBI reference has uppercase Ns and ENSEMBL has lowecase Ns (n) when looking at the headers. However, both seem to be soft-masked genomic DNA i.e. all repeats and low complexity regions have been replaced with lowercased versions of their nucleic base. (Screenshots below).

==> GCF_000001405.40_GRCh38.p14_genomic.fna <==
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(base) NCI-02225697-ML:microRNA jaink4$ head GCF_000001405.40_GRCh38.p14_genomic.fna
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Ensembl reference:

jaink4@nci-cgr:/DCEG/CGF/Research/RD168_Chernobyl_TN-Pairs/ANALYSIS_miR/2023-05-30-miRNA-pipeline-test/ref$ head  Homo_sapiens.GRCh38.dna_sm.toplevel.fa
>1 dna_sm:chromosome chromosome:GRCh38:1:1:248956422:1 REF
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

ENSEMBL

Screenshot 2023-06-07 at 6 18 43 PM

GenBank

Screenshot 2023-06-07 at 6 19 41 PM