comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
247 stars 59 forks source link

generateEvents option returning empty .ioe files #178

Open ArnavBharti opened 6 months ago

ArnavBharti commented 6 months ago

For input:

  1. I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
  2. I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
  3. suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}

Output: empty .ioe files eg.

seqname gene_id event_id    alternative_transcripts total_transcripts

GTF file:

PvP01_API_v2    VEuPathDB   ncRNA_gene  9   63  .   +   .   gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA";
PvP01_API_v2    VEuPathDB   tRNA    9   63  .   +   .   gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA";
PvP01_API_v2    VEuPathDB   exon    9   63  .   +   .   gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1";
PvP01_API_v2    VEuPathDB   protein_coding_gene 90  704 .   +   .   gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding";
PvP01_API_v2    VEuPathDB   mRNA    90  704 .   +   .   gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding";
PvP01_API_v2    VEuPathDB   exon    90  704 .   +   .   gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1";
PvP01_API_v2    VEuPathDB   CDS 90  704 .   +   0   gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1";
PvP01_API_v2    VEuPathDB   ncRNA_gene  716 787 .   +   .   gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";
EduEyras commented 6 months ago

Hi Arnav,

could you please send me the GTF here: @.*** ?

In principle the format looks fine. But it was generated from a GFF, which has a single ID in column 9. So I am wondering whether your GTF contains multiple transcript IDs associated to the same Gene ID. Without that, SUPPA cannot calculate the AS events.

Thanks

Eduardo

On Wed, 3 Jan 2024 at 14:26, Arnav Bharti @.***> wrote:

For input:

  1. I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
  2. I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
  3. suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}

Output: empty .ioe files eg.

seqname gene_id event_id alternative_transcripts total_transcripts

GTF file:

PvP01_API_v2 VEuPathDB ncRNA_gene 9 63 . + . gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB tRNA 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA"; PvP01_API_v2 VEuPathDB exon 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1"; PvP01_API_v2 VEuPathDB protein_coding_gene 90 704 . + . gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB mRNA 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding"; PvP01_API_v2 VEuPathDB exon 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1"; PvP01_API_v2 VEuPathDB CDS 90 704 . + 0 gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1"; PvP01_API_v2 VEuPathDB ncRNA_gene 716 787 . + . gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB4MYI243INRJNOFAHLYMTFU7AVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DGMJZGI2TMOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ArnavBharti commented 6 months ago

I managed to get the GTF file (instead of converting from GFF).

A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in LSU 5.8S rRNA; O-type was split into two causing an IndexError while parsing. I was wondering if replacing ; with ; (removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.

BUT the .ioe file generated is still empty.

genomic_data.gtf.zip

att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <- this line is causing the index error

EduEyras commented 6 months ago

Thanks for sending the GTF file.

The issue is what I suspected. The GTF does not contain multiple transcripts per gene.

Using only the "exon" lines (used by SUPPA), I counted how many genes "gene_id" contain more than 1 "transcript_id", but there are none:

awk '$3=="exon"' genomic_data.gtf | cut -f9 | cut -f1-4 -d" " | sort -u | cut -f1,2 -d" " | sort | uniq -c | awk '$1>1' | wc -l

   0

In these cases, you can use the option -p | --pool-genes when running generateEvents This option will first cluster the transcripts into genes according to exon overlaps in the same strand. It uses a simple definition of gene to be transcripts with exons that overlap on the same strand and share at least one splice-site.

This will create new gene IDs based on the gene_id IDs that are overlapping in genomic loci.

Please try this to see whether there are overlapping transcripts in the annotated genomic loci.

I hope this helps

Best

Eduardo

On Fri, 5 Jan 2024 at 03:04, Arnav Bharti @.***> wrote:

I managed to get the GTF file (instead of converting from GFF).

A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - . gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref "GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type"; orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in LSU 5.8S rRNA; O-type was split into two causing an IndexError while parsing. I was wondering if replacing ; with ; (removing space) where it occurs inside quotation marks would affect the output. 'Cause by doing this IndexError goes away.

BUT the .ioe file generated is empty.

genomic_data.gtf.zip https://github.com/comprna/SUPPA/files/13832434/genomic_data.gtf.zip

att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <- this line is causing the index error

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1877351898, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6VT4PGWOCB4PDTY43YM3HIZAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGM2TCOBZHA . You are receiving this because you commented.Message ID: @.***>

ArnavBharti commented 6 months ago

Even with pool genes flag the output is still empty.

$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes

INFO:eventGenerator:Reading input data.
INFO:eventGenerator:Pooling genes
INFO:eventGenerator:Calculating events
INFO:eventGenerator:Done

$ cat local*.ioe > biglocal.ioe $ cat biglocal.ioe

seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
seqname gene_id event_id    alternative_transcripts total_transcripts
EduEyras commented 6 months ago

Hi Arnav,

this means that the GTF annotation that you're using does not contain any alternative transcripts in the annotated genes, or that the variation of those transcripts cannot be expressed in terms of splicing events.

I checked, and there are no alternative transcripts; that is, the annotation is 1-gene:1-transcript. SUPPA needs an annotation with overlapping transcripts to calculate the events.

You could either enhance the annotation with novel transcripts using RNA-seq reads with a tool like StringTie (or StringTie2 or FLAIR if you have long reads), or directly use RNA-seq reads with e.g. rMATs or MAJIQ to directly calculate alternative splicing events.

Another option is if Ensembl has other annotations that may be more transcript rich (https://protists.ensembl.org/info/data/ftp/index.html)

I hope this helps

Eduardo

On Fri, 5 Jan 2024 at 15:55, Arnav Bharti @.***> wrote:

Even with pool genes flag the output is still empty.

$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e {SE,SS,MX,RI,FL} --pool-genes

INFO:eventGenerator:Reading input data. INFO:eventGenerator:Pooling genes INFO:eventGenerator:Calculating events INFO:eventGenerator:Done

$ cat local*.ioe > biglocal.ioe $ cat biglocal.ioe

seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts seqname gene_id event_id alternative_transcripts total_transcripts

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/178#issuecomment-1878123857, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6BPP6N7LEW5CVLSKTYM6BULAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGEZDGOBVG4 . You are receiving this because you commented.Message ID: @.***>