Pool-genes option - Githubissues

marypiper commented 3 years ago

Hi,

Thank you for your work on this tool and the nice tutorials detailing its use. I have a question regarding the use of the --pooled-genes option when building the ioi/ioe files from loci-based annotations. In the documentation it seems this option is clearly necessary when using non-loci-based annotations (e.g. RefSeq and UCSC genes), while the manual indicates that this is not a severe problem for loci-based annotations. However, the manual also states "Using the --pool-genes option is also advisable to use with Ensembl and Gencode."

I am using a Wormbase annotation, which is loci-based. Do you always advise using the --pool-genes option? If not, under what circumstances would you suggest using this option when your annotation file is loci-based?

EduEyras commented 3 years ago

Hi,

thanks for your email.

In human gencode, there used to be overlapping transcripts in the same strand and sharing exons that are annotated as different genes because their start of transcription was different, or because they're annotated as processed pseudogenes, or because even though they overlap they produce different proteins. If you still want to consider these possible variations of the RNA processing as well, it could be interested to include them.

I did a quick check with Gencode v27. From the total 182938 events calculated, 79258 of them would involve more than one gene if you used the pool-genes option. And you find them of all types: 5580 A3 5390 A5 36343 AF 11721 AL 2473 MX 2112 RI 15639 SE

Even if after removing pseudogenes there seem to be many: 4938 A3 4740 A5 32489 AF 10088 AL 2144 MX 1935 RI 13334 SE

so it might be worth trying to see whether there RNA processing events happening that you might miss otherwise.

I hope this helps

E.

On Tue, 26 Jan 2021 at 07:54, marypiper notifications@github.com wrote:

Hi,

Thank you for your work on this tool and the nice tutorials detailing its use. I have a question regarding the use of the --pooled-genes option when building the ioi/ioe files from loci-based annotations. In the documentation it seems this option is clearly necessary when using non-loci-based annotations (e.g. RefSeq and UCSC genes), while the manual indicates that this is not a severe problem for loci-based annotations. However, the manual also states "Using the --pool-genes option is also advisable to use with Ensembl and Gencode."

I am using a Wormbase annotation, which is loci-based. Do you always advise using the --pool-genes option? If not, under what circumstances would you suggest using this option when your annotation file is loci-based?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB64F4HSPSYUYQVRFQTS3XK7PANCNFSM4WSLAYSQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

marypiper commented 3 years ago

Thank you @EduEyras for the quick and detailed response! I appreciate you taking the time to investigate.

danphillips28 commented 3 years ago

Since this thread is still open I thought I'd drop a question in about pooled-genes. I've been wondering, and I expect I just haven't thought hard enough about this, but how does one report pooled-genes? E.g., say I find 200 significantly DAS events. I'd want to say something like "... found 200 DAS events from x genes". Is each gene in the pool counted individually? Or Is it the collective that's counted?

EduEyras commented 3 years ago

Hi Dan,

thanks for the question.

The pool-genes function simply implements the automatic definition of genes from transcript units as it's done in Ensembl (without taking into account exceptions that might be added ad-hoc from the manual curators).

It takes all transcripts, checks which ones overlap in the same strand, and share at least a splice-site, and gives them a gene_id label, so SUPPA will consider all those transcripts as part of the same gene locus and calculate the local AS events that they define.

We added this so that one can work with e.g. the RefSeq annotation which only defines transcript units NM_... or with results from mapping cDNAs or long-reads to the genome

So one could say, "... we identified N alternative splicing events from M gene loci" or something like that.

The code is available, so you could also describe that "gene loci were defined using SUPPA --pool-genes function"

Is this what you were asking about?

Please let me know if you have any further questions

best

Eduardo

On Thu, 25 Mar 2021 at 01:13, danphillips28 @.***> wrote:

Since this thread is still open I thought I'd drop a question in about pooled-genes. I've been wondering, and I expect I just haven't thought hard enough about this, but how does one report pooled-genes? E.g., say I find 200 significantly DAS events. I'd want to say something like "... found 200 DAS events from x genes". Is each gene in the pool counted individually? Or Is it the collective that's counted?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/117#issuecomment-805855527, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB7QSKCDFVHBYDUBE33TFHXPJANCNFSM4WSLAYSQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras commented 2 years ago

Closing issue, as no more questions were raised on this topic.

comprna / SUPPA

Pool-genes option #117