comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
258 stars 61 forks source link

DTU #43

Closed wong15w closed 5 years ago

wong15w commented 5 years ago

While trying to run the psiPerIsoform, I found that I was receiving a file where all isoform events are either marked as 1.0 or NaN. I am using a Refseq annotation file for this command, and was wondering if this had to do with the Refseq annotation?

EduEyras commented 5 years ago

Hi,

Is this the RefSeq GTF? We have seen that the IDs used by RefSeq for the GTF or other annotations are not unique for each locus. That is, different loci in the genome may include the same RefSeq transcript ID. Additionally, gene loci might not be properly defined with the multiple transcript variants, etc... This is completely different in the Ensembl GTF, as Ensembl annotates gene loci, and provides unique transcript and gene IDs.

You can still use RefSeq, but need to run generateEvents with the option -p | --pool-genes:

This will recluster RefSeq transcripts into genes and labelling genes uniquely per locus (cluster). Additionally transcript ids are added .1, .2, etc if they appear multiple times in the genome.

The method is a Depth First algorithm based on the connections of transcripts that overlap in genomic extent and strand and share at least 1 splice site. That will put RefSeq in a "meaningful" format.

I hope this helps

Eduardo

On Thu, Oct 25, 2018 at 4:48 PM wong15w notifications@github.com wrote:

While trying to run the psiPerIsoform, I found that I was receiving a file where all isoform events are either marked as 1.0 or NaN. I am using a Refseq annotation file for this command, and was wondering if this had to do with the Refseq annotation?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/43, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWVB2yevZS7_n2eA9JWcJ-ytG1XGRKFks5uoc9FgaJpZM4X6c_Z .

-- Dr E Eyras

ICREA Research Professor Universitat Pompeu Fabra PRBB, Dr Aiguader 88 Tel: +34 93 316 0502 E08003 Barcelona, Spain Fax: +34 93 316 0550

http://scholar.google.com/citations?user=LiojlGoAAAAJ http://www.researcherid.com/rid/L-1053-2014 http://regulatorygenomics.upf.edu/

wong15w commented 5 years ago

Hi,

So I can use the generateEvents pool-genes options to get a reformatted gtf file to feed into the psiPerIsoform command?

Wilfred

EduEyras commented 5 years ago

Yes, that's right

Let me know if you have any trouble.

There should be a way to get the modified GTF. If not, we should provide one

Best

E

Dr. Eyras ICREA Research Professor Universitat Pompeu Fabra Barcelona Spain

On 26 Oct 2018, at 03:52, wong15w notifications@github.com wrote:

Hi,

So I can use the generateEvents pool-genes options to get a reformatted gtf file to feed into the psiPerIsoform command?

Wilfred

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wong15w commented 5 years ago

Hi there,

Still a bit stuck on trying to get this to work. I used generateEvents to obtain the ioi file with pool genes enabled, but am unsure of how I can get a properly reformatted gtf file for use with the psiPerIsoform command.

Wilfred

EduEyras commented 5 years ago

Hi,

I send you a standalone script (in PERL) that implements the algorithm to build genes from transcripts when these are not (well) defined (e.g. the GTF from UCSC).

This is what --pool-genes implements

There are small operations related to the handling of GTF/GFF formats, but overall the algorithm does this:

1) cluster transcripts whose genomic extension overlap in the same strand in the genome. This part is quite fast since a sorting is done per chr and strand, and once sorted, it runs linear in the number of transcripts.

2) Takes each transcript cluster, and links transcripts according to whether they share at least one splice-site of an exon. With this it creates a graph (builds an adjacency matrix) and recovers the connected components running depth-first. Each connected component of a transcript cluster is defined as a gene. This implementation allows to have nested genes, genes within introns of other genes, and overlapping genes in opposite strands. This part is also fast as depth-first is only run per cluster. Also, the implementation of depth-first is non-recursive to avoid uncontrolled memory growth.

It does not use any library. Sorry it is PERL. Perhaps the least easy part to read are those parts that handle the specifics of GTFs, perl implementations of pointers to dictionaries and lists, etc... The code is documented, but let me know if you have any question.

The program reads directly the arguments in this order:

perl cluster_GTF_into_gene_loci.pl input.gtf GTF GTF

We should make the algorithm of --pool-genes available as a standalone in SUPPA, so it can solve cases like yours.

I hope this helps

E.

On Sat, Oct 27, 2018 at 5:41 PM wong15w notifications@github.com wrote:

Hi there,

Still a bit stuck on trying to get this to work. I used generateEvents to obtain the ioi file with pool genes enabled, but am unsure of how I can get a properly reformatted gtf file for use with the psiPerIsoform command.

Wilfred

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/43#issuecomment-433631187, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWVBxYxwDw_Lm7vjxmA5VG0lTq0_8fRks5upH6SgaJpZM4X6c_Z .

-- Dr E Eyras

ICREA Research Professor Universitat Pompeu Fabra PRBB, Dr Aiguader 88 Tel: +34 93 316 0502 E08003 Barcelona, Spain Fax: +34 93 316 0550

http://scholar.google.com/citations?user=LiojlGoAAAAJ http://www.researcherid.com/rid/L-1053-2014 http://regulatorygenomics.upf.edu/

wong15w commented 5 years ago

Hi there,

Thanks for the help, but I currently do not know where you have sent the .pl file. Is it to my email, or is it supposed to be linked in the comment?

Thanks,

Wilfred

EduEyras commented 5 years ago

Hi,

sorry, perhaps attachments do not go through the system.

Can you send me an email to me (eduardo.eyras at upf.edu) and I will send the script to you.

cheers

E.

On Tue, Oct 30, 2018 at 5:10 AM wong15w notifications@github.com wrote:

Hi there,

Thanks for the help, but I currently do not know where you have sent the .pl file. Is it to my email, or is it supposed to be linked in the comment?

Thanks,

Wilfred

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/43#issuecomment-434165398, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWVBxUdEEBIDYSZ9UZno80WQRxowD7Jks5up9ErgaJpZM4X6c_Z .

-- Dr E Eyras

ICREA Research Professor Universitat Pompeu Fabra PRBB, Dr Aiguader 88 Tel: +34 93 316 0502 E08003 Barcelona, Spain Fax: +34 93 316 0550

http://scholar.google.com/citations?user=LiojlGoAAAAJ http://www.researcherid.com/rid/L-1053-2014 http://regulatorygenomics.upf.edu/