AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Comparison of Reference File Generation #73

Closed allyhawkins closed 3 years ago

allyhawkins commented 3 years ago

I'm separating out the analysis started in #68 as a separate PR, to just bring attention to making a final decision on whether or not we should alter how we are generating our reference files for creating indexs using in the scpca workflows. Currently we are using downloading the cdna.fa and ncrna.fa files from ensembl directly and then concatenating them to create a transcriptome.fa file to be used to generate indexs for Alevin and Kallisto for scRNA-seq pre-processing.

In generating the reference files for snRNA-seq for Alevin and Kallisto, we now need to grab the primary_assembly.fa and gtf, subset the spliced mRNA and flanking introns using eisaR::getFeatureRanges, and then output a corresponding fasta, gtf, and transcript to gene mapping. The question being posed is can we maintain consistency between our references for both snRNA-seq and scRNA-seq by now doing the same thing for the main transcriptome.fa rather than concatenating the files.

To determine this, I have compared the fasta's and gtf files that come directly from ensembl vs. from using eisaR::getFeatureRanges on the primary_assembly.fa and gtf files. In my analysis, I am finding that these files are not the same, and there are transcripts missing that correspond to protein coding regions. Due to these discrepancies, I think it is in our best interest to use our original reference files and only apply the eisaR::getFeatureRanges method to get reference files needed for snRNA-seq. It appears that is the best way to do that right now. As long as there are no objections to this, then the reference file scripts in #68 will be used to generate only the files needed for snRNA-seq.

allyhawkins commented 3 years ago

In an effort to not keep duplicate analysis, I removed the first version of the notebook so there is just the one notebook containing reference file comparisons to merge so we can close #54.