Add alternative analysis of our spliced vs. ensembl

jashapiro commented 3 years ago

I did some quick alternative analysis of the gtf and fasta files that we generated, and got some very different looking results, so I wanted to add them here for reference & comparison.

Rather than dealing with the rtracklayer & biostrings directly, I did the simplest thing I could think, which was to just extract the transcript ids from each file and work from there.

The top line result is that the gtf files contain all of the same transcript IDs, but there are some transcripts in the Ensembl fasta file that are not in the Ensembl gtf. I have not yet looked to see what the characteristics of the extra transcripts are, but that would seem the next logical step.

I am filing this as a PR to @allyhawkins branch because that seemed the most appropriate.

jashapiro commented 3 years ago

It looks like the ensembl transcript fasta file includes data from the patch and haplotype data. I think the gtf file that includes these is here, if we want to use it to generate other reference files: ftp://ftp.ensembl.org//pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.chr_patch_hapl_scaff.gtf.gz

I think we would need to include the extended genome reference to do this though... I think we were using the primary assembly, so we might need to sub in ftp://ftp.ensembl.org//pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz which I think includes all the extra goodness.

jashapiro commented 3 years ago

What I am having trouble understanding is how the transcript ID's can be the same across the gtf files, but then there are still unique transcript ID's in the spliced fasta and ensembl fasta?

The "unique" transcript IDs in the spliced fasta are those that are not in the Ensembl fasta, but they are in both gtf files. The part I don't understand is where the transcripts that are in the Ensembl gtf but missing from the Ensembl fasta went... But that is a question for Ensembl, I guess?

As for using toplevel vs primary assembly; I don't think it should make any difference for the non-decoy alignment; the difference will be if we want to include transcripts that are only present on alternate haplotypes or patches. It appears that the ensembl fasta already does that, for what it is worth. It may well be that if we build a transcriptome fasta from the toplevel fasta + chr_patch gtf we will get something that is closer to the downloaded Ensembl fasta.

If not for all those pesky SnRNA-Seq samples, I wouldn't care at all!

allyhawkins commented 3 years ago

Based on our discussions today that we are going to move forward with using the primary assembly fasta to generate our index's rather than try and load and deal with the top level fasta, @jashapiro I think we should go ahead and merge this analysis in so we have it for reference and close #54 for now.

AlexsLemonade / alsf-scpca

Add alternative analysis of our spliced vs. ensembl #74