estepi / ASpli

Analysis of alternative splicing using RNAseq
7 stars 1 forks source link

Issues on BinGenome #14

Open lucaskbobadilla opened 4 years ago

lucaskbobadilla commented 4 years ago

Hello,

I am having some weird results after running binGenome. I am following the example analysis.

First I build the txDb as described using genomicFeature package:

# build a TxDb of the genome (uses genomeFeatures package)

TxDb <- makeTxDbFromGFF( file="WH_augustus_chr_only.genes.gff3", format="gff3")

Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK

No problems with that. But when I try to generate the features using:

features <- binGenome(TxDb) # extract features from annotation

I am getting this output:

Why I am not getting any AS bins? Is there something missing in the gff file? or is there any step that I missed in the txDb step?

Thank you!

estepi commented 4 years ago

Hi, Yes. It seems there is something weird in the annotation Can you extract exons from your TxDb object? You can check like this:

EX <- exons(TxDb) head(EX)

thanks a lot

lucaskbobadilla commented 4 years ago

Hi,

Here is the result for it:

`> EX <- exons(TxDb)

EX GRanges object with 168476 ranges and 1 metadata column: seqnames ranges strand | exon_id

| [1] asm.contigs_Scaffold_1 24476-24691 + | 1 [2] asm.contigs_Scaffold_1 25311-25377 + | 2 [3] asm.contigs_Scaffold_1 26086-26114 + | 3 [4] asm.contigs_Scaffold_1 26210-26418 + | 4 [5] asm.contigs_Scaffold_1 28111-28221 + | 5 ... ... ... ... . ... [168472] asm.contigs_Scaffold_9 33870715-33871005 - | 168472 [168473] asm.contigs_Scaffold_9 33873762-33873832 - | 168473 [168474] asm.contigs_Scaffold_9 33873975-33874063 - | 168474 [168475] asm.contigs_Scaffold_9 33874443-33874517 - | 168475 [168476] asm.contigs_Scaffold_9 33874599-33874942 - | 168476 ------- seqinfo: 16 sequences from an unspecified genome; no seqlengths`
estepi commented 4 years ago

Hi, Yes, it seems very complicated how genes and exons are annotated. do you think is any way to simplify? Specially for characteres like "-", "|", etc... For example, remove " asm.contigsScaffold" ,

thanks Estefi

lucaskbobadilla commented 4 years ago

I remove it still getting the same output.

estepi commented 4 years ago

I see. We should simplify in some way, because it is causing troubles for txDb objects. If you want, you can send me part of your gtf and I can try to explore a bit more your problem. thanks

Rohit-Satyam commented 3 years ago

Hi @estepi

I am facing the same issue.

* Number of extracted Genes = 42031
* Number of extracted Exon Bins = 158249
* Number of extracted intron bins = 129751
* Number of extracted trascripts = 42031
* Number of extracted junctions = 129751
* Number of AS bins (not include external) = 0
* Number of AS bins (include external) = 0
* Classified as:
        ES bins = 0     (NaN%)
        IR bins = 0     (NaN%)
        Alt5'ss bins = 0        (NaN%)
        Alt3'ss bins = 0        (NaN%)
        Multiple AS bins = 0    (NaN%)
        classified as:
                        ES bins = 0     (NaN%)
                        IR bins = 0     (NaN%)
                        Alt5'ss bins = 0        (NaN%)
                        Alt3'ss bins = 0        (NaN%)
EX <- exons(TxDb)
> EX
GRanges object with 171782 ranges and 1 metadata column:
                 seqnames        ranges strand |   exon_id
                    <Rle>     <IRanges>  <Rle> | <integer>
       [1]     CH398230.1     5010-5088      + |         1
       [2]     CH398230.1     6709-6764      + |         2
       [3]     CH398230.1     8466-8665      + |         3
       [4]     CH398230.1     8856-8898      + |         4
       [5]     CH398230.1 127667-128488      + |         5
       ...            ...           ...    ... .       ...
  [171778] AAAA02050198.1       21-1187      + |    171778
  [171779] AAAA02050204.1      733-1101      - |    171779
  [171780] AAAA02050217.1       467-598      + |    171780
  [171781] AAAA02050217.1       685-840      + |    171781
  [171782] AAAA02050218.1        49-813      - |    171782

I obtained my gtf from here: ftp://ftp.ensemblgenomes.org/pub/plants/release-48/gtf/oryza_indica/

#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2011-10
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1       bgi     gene    18113   20165   .       +       .       gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1       bgi     transcript      18113   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    18113   19150   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1       bgi     CDS     18113   19150   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA";
1       bgi     start_codon     18113   18115   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    19344   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1       bgi     CDS     19344   20162   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA";
estepi commented 3 years ago

Dear all,

Can you run this code and tell me what do you find?

library(GenomicFeatures) library(ASpli)

genomeTxDb <- makeTxDbFromGFF("Oryza_indica.ASM465v1.48.gtf")

features <- binGenome(genomeTxDb) head(features@bins) length(features@bins)

I checked Oryza annotation and I have the bins.

thanks a lot for your patience

Rohit-Satyam commented 3 years ago

Hi @estepi

I am getting genomic bins. However, I want to study Intron retention events and other AS events too. The output of your code is as follows

> genomeTxDb <- makeTxDbFromGFF("Oryza_indica.ASM465v1.48.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> features <- binGenome(genomeTxDb)
* Number of extracted Genes = 42031
* Number of extracted Exon Bins = 158249
* Number of extracted intron bins = 129751
* Number of extracted trascripts = 42031
* Number of extracted junctions = 129751
* Number of AS bins (not include external) = 0
* Number of AS bins (include external) = 0
* Classified as:
        ES bins = 0     (NaN%)
        IR bins = 0     (NaN%)
        Alt5'ss bins = 0        (NaN%)
        Alt3'ss bins = 0        (NaN%)
        Multiple AS bins = 0    (NaN%)
        classified as:
                        ES bins = 0     (NaN%)
                        IR bins = 0     (NaN%)
                        Alt5'ss bins = 0        (NaN%)
                        Alt3'ss bins = 0        (NaN%)

> head(features@bins)
GRanges object with 6 ranges and 8 metadata columns:
                       seqnames    ranges strand |         locus         bin
                          <Rle> <IRanges>  <Rle> |   <character> <character>
  BGIOSGA040744:E001 CH398230.1 5010-5088      + | BGIOSGA040744        E001
  BGIOSGA040744:I001 CH398230.1 5089-6708      + | BGIOSGA040744        I001
  BGIOSGA040744:E002 CH398230.1 6709-6764      + | BGIOSGA040744        E002
  BGIOSGA040744:I002 CH398230.1 6765-8465      + | BGIOSGA040744        I002
  BGIOSGA040744:E003 CH398230.1 8466-8665      + | BGIOSGA040744        E003
  BGIOSGA040744:I003 CH398230.1 8666-8855      + | BGIOSGA040744        I003
                         feature        symbol locus_overlap       class
                     <character>   <character>   <character> <character>
  BGIOSGA040744:E001           E BGIOSGA040744             -    external
  BGIOSGA040744:I001           I BGIOSGA040744             -           -
  BGIOSGA040744:E002           E BGIOSGA040744             -           -
  BGIOSGA040744:I002           I BGIOSGA040744             -           -
  BGIOSGA040744:E003           E BGIOSGA040744             -           -
  BGIOSGA040744:I003           I BGIOSGA040744             -           -
                           event      eventJ
                     <character> <character>
  BGIOSGA040744:E001    external    external
  BGIOSGA040744:I001           -           -
  BGIOSGA040744:E002           -           -
  BGIOSGA040744:I002           -           -
  BGIOSGA040744:E003           -           -
  BGIOSGA040744:I003           -           -
  -------
  seqinfo: 2140 sequences from an unspecified genome; no seqlengths
> length(features@bins)
[1] 288000

My concern is IR, ES and other bins show zero percentage. Is it normal to have these bins zero? I ran the examplary gtf and while binning I found these bins to be non-zero

> annFile <- aspliExampleGTF()
> aTxDb <- makeTxDbFromGFF(annFile)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
>  features <- binGenome( aTxDb )
* Number of extracted Genes = 10
* Number of extracted Exon Bins = 38
* Number of extracted intron bins = 34
* Number of extracted trascripts = 25
* Number of extracted junctions = 23
* Number of AS bins (not include external) = 17
* Number of AS bins (include external) = 17
* Classified as:
        ES bins = 1     (6%)
        IR bins = 3     (18%)
        Alt5'ss bins = 2        (12%)
        Alt3'ss bins = 2        (12%)
        Multiple AS bins = 9    (53%)
        classified as:
                        ES bins = 1     (11%)
                        IR bins = 4     (44%)
                        Alt5'ss bins = 2        (22%)
                        Alt3'ss bins = 1        (11%)

I am trying to understand if none of the bins were assigned as IR or ES or any other AS types, then will I get any IR or ES event being reported?

estepi commented 3 years ago

Hi @Rohit-Satyam, thanks for your reply

I see. No, it is not normal. If there is more than 1 isoform, we assume there are some bins that are supossed to be "alternative" This is our approach. I'm not familiar with your annotation, I should explore more and try to figure out why we are not finding AS bins.

What I would try is to continue with the pipeline and see if you can detect intron retention, even though introns are not labelled as IR. Can you try this in the meantime?

Thanks

Rohit-Satyam commented 3 years ago

Sure. I will give it a shot. Thanks for your prompt response. I really appreciate your time.

Rohit-Satyam commented 3 years ago

Hi @estepi

I had this feeling that such behavior could only happen if for every gene there is exactly a single transcript in the gtf file. I went back to check if that's the issue. And I think in Osativa indica gtf, that is the actual issue. I also tried running the same steps with the Osativa japonica gtf and it runs fine. Can you confirm, if it's the gtf that deserves all the blame

> library(ASpli)
> library(GenomicFeatures)
> genomeTxDb <- makeTxDbFromGFF("Oryza_sativa.IRGSP-1.0.48.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This
  information was ignored.
> features <- binGenome(genomeTxDb)
* Number of extracted Genes = 38978
* Number of extracted Exon Bins = 167945
* Number of extracted intron bins = 128442
* Number of extracted trascripts = 45772
* Number of extracted junctions = 126775
* Number of AS bins (not include external) = 3424
* Number of AS bins (include external) = 3463
* Classified as: 
    ES bins = 283   (8%)
    IR bins = 2039  (60%)
    Alt5'ss bins = 345  (10%)
    Alt3'ss bins = 592  (17%)
    Multiple AS bins = 165  (5%)
    classified as:
            ES bins = 29    (18%)
            IR bins = 57    (35%)
            Alt5'ss bins = 31   (19%)
            Alt3'ss bins = 40   (24%)
estepi commented 3 years ago

Hi @Rohit-Satyam, great news ! :)

It can happen because annotation is very young, isnt it?

Anyway, you can still analyze IR despite this miss-annotation and see what happens...And maybe you will be the firts to systematically describe AS events in this specie ;)

thanks for your post