how to make a useful .txdb for quantification

Shiywa commented 10 months ago

hello Roleren, recently, I was adding some virus genes in human genome, I did it like add fasta sequence in human fasta and make some simple annotation info at the tail of gtf files, just like:

GFP unknown gene    1   720 .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP unknown transcript  1   720 .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP unknown exon    1   720 .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
GFP unknown CDS 1   720 .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
EGFP    unknown gene    1   732 .   +   .   gene_id "EGFP"; transcript_id "EGFP"; gene_name "EGFP"; gene_biotype "protein_coding";
EGFP    unknown transcript  1   732 .   +   .   gene_id "EGFP"; transcript_id "EGFP"; gene_name "EGFP"; gene_biotype "protein_coding";
EGFP    unknown exon    1   732 .   +   .   gene_id "EGFP"; transcript_id "EGFP"; gene_name "EGFP"; gene_biotype "protein_coding";
EGFP    unknown CDS 1   732 .   +   .   gene_id "EGFP"; transcript_id "EGFP"; gene_name "EGFP"; gene_biotype "protein_coding";
NS2A    unknown gene    1   711 .   +   .   gene_id "NS2A"; transcript_id "NS2A"; gene_name "NS2A"; gene_biotype "protein_coding";
NS2A    unknown transcript  1   711 .   +   .   gene_id "NS2A"; transcript_id "NS2A"; gene_name "NS2A"; gene_biotype "protein_coding";
NS2A    unknown exon    1   711 .   +   .   gene_id "NS2A"; transcript_id "NS2A"; gene_name "NS2A"; gene_biotype "protein_coding";
NS2A    unknown CDS 1   711 .   +   .   gene_id "NS2A"; transcript_id "NS2A"; gene_name "NS2A"; gene_biotype "protein_coding";
NS4B    unknown gene    1   855 .   +   .   gene_id "NS4B"; transcript_id "NS4B"; gene_name "NS4B"; gene_biotype "protein_coding";
NS4B    unknown transcript  1   855 .   +   .   gene_id "NS4B"; transcript_id "NS4B"; gene_name "NS4B"; gene_biotype "protein_coding";
NS4B    unknown exon    1   855 .   +   .   gene_id "NS4B"; transcript_id "NS4B"; gene_name "NS4B"; gene_biotype "protein_coding";
NS4B    unknown CDS 1   855 .   +   .   gene_id "NS4B"; transcript_id "NS4B"; gene_name "NS4B"; gene_biotype "protein_coding";

then I noticed that ORFik quantified transcripts number in the step ORFikQC(df.rfp) and need to use the gtf.txdb file. I aimed to get a new txdb file which include the added information like GFP, NS2A et.

Then I tried to make txdb by myself, like:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("/home/bio/biosoft/riboseq_data/references/Homo_sapiens_GRCh38_101_GFP/Homo_sapiens.GRCh38.110._ensembl.gtf",format = "gtf",organism = "Homo sapiens",
                taxonomyId = 9606, chrominfo = new_chrome_info)
saveDb(txdb,file = "/home/bio/biosoft/riboseq_data/references/Homo_sapiens_GRCh38_101_GFP/Homo_sapiens.GRCh38.110._ensembl.gtf.db")

the txdb file is successfully conducted at ["tx"] ["mrna"] ["cds"] but lost ["leaders"]. maybe because I didn't provide 5'UTR and 3'UTR information for these four genes ?

["mrna] info is like:

a2 <- loadRegion(txdb, "mrna")
> a2[111077:111080]
 GRangesList object of length 4:
 $GFP
 GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]      GFP     1-720      + |    798109        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $EGFP
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     EGFP     1-732      + |    798110        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $NS2A
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     NS2A     1-711      + |    798111        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $NS4B
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     NS4B     1-855      + |    798112        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

> a2[111076:111080]
GRangesList object of length 5:
 $ENST00000612315
GRanges object with 1 range and 3 metadata columns:
        seqnames      ranges strand |   exon_id       exon_name exon_rank
           <Rle>   <IRanges>  <Rle> | <integer>     <character> <integer>
  [1] KI270713.1 31698-32528      - |    798108 ENSE00003739404         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $GFP
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]      GFP     1-720      + |    798109        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $EGFP
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     EGFP     1-732      + |    798110        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $NS2A
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     NS2A     1-711      + |    798111        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

 $NS4B
GRanges object with 1 range and 3 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     NS4B     1-855      + |    798112        <NA>         1
  -------
  seqinfo: 198 sequences (1 circular) from an unspecified genome

Then, I confirmed that there are reads mapped into these genes from my files like Ribo-Seq_A2_1_Aligned.sortedByCoord.out.bam et. however, after quantification by countTable_regions, all of there genes are 0, like :

> res[which(res$id %in% c("GFP","EGFP","NS2A","NS4B")),]
             contrast Regulation   id rna rfp te rna.padj rfp.padj te.padj rna.meanCounts rfp.meanCounts
1: Comparison: A vs B  No change  GFP  NA  NA NA       NA       NA      NA              0              0
2: Comparison: A vs B  No change EGFP  NA  NA NA       NA       NA      NA              0              0
3: Comparison: A vs B  No change NS2A  NA  NA NA       NA       NA      NA              0              0
4: Comparison: A vs B  No change NS4B  NA  NA NA       NA       NA      NA              0              0

I'm really confused. Can you give me some advice?

notebly, the added gtf files is not same with real genes but cds region which binding to plasmid, is it related to the result?

sincerely !

Roleren commented 10 months ago

First question, about leaders and trailers. Yes, as long as gtf specifies an exon and CDS starts at same coordinate as first exon, there is no leader or trailer for those transcripts. To make fake, just assign in the gtf some coordinates before and after your cds. This is of course not possible if sequences are from single orf sequences (first nucleotide is start site of cds etc). But then there would not make any sense to have leaders anyway.
Second question, so from the count table it finds the transcripts, but the count is 0.

Have your verified that this gets counts ? countOverlaps(a2[111077:111080], rfp) # rfp is riboseq data loaded (mapped against genome with the viral sequences)

If it does not, then it means no reads map there. Let me know the output of countOverlaps.

Shiywa commented 10 months ago

no, there no reads mapped.

> countOverlaps(a2[111077:111080], RFP_A_r1)
 GFP EGFP NS2A NS4B 
   0    0    0    0 
> countOverlaps(a2[111077:111080], RFP_A_r2)
 GFP EGFP NS2A NS4B 
   0    0    0    0 
> countOverlaps(a2[111077:111080], RFP_B_r1)
 GFP EGFP NS2A NS4B 
   0    0    0    0 
> countOverlaps(a2[111077:111080], RFP_B_r2)
 GFP EGFP NS2A NS4B 
   0    0    0    0

however, I found reads mapped these four genes in .bam files.

what happened ?

sincerely !

Roleren commented 10 months ago

These are old ribo-seq files then, so you must have made new bam files, but not new of the other file formats ?

First Delete your "QC_STATS", "ofst" and "pshifted" folders of the experiment (inside the aligned folder), if you had things here before it will load this instead of using the "new" bam files.

Then do this:

rm(list=ls())
df <- read.experiment("your_exp")
QCreport(df) # Will now make report not using old ofst files or count tables

Shiywa commented 10 months ago

yes, you are right. we have to delete ofst files. after delete it, reads number is right

Roleren commented 10 months ago

Good, any more questions on that? else I close the issue :)

Shiywa commented 10 months ago

no

Roleren / ORFik

how to make a useful .txdb for quantification #153