Update reference scripts to make splici.fa

To start addressing #95, we need to create the necessary fasta and tx2gene files for the splici index. I started by adding in just the addition of making the tx2gene_3col.tsv mapping that includes a third column containing information whether or not each transcript is categorized as spliced, unspliced, or ambiguous and then realized that the additional steps indicated in the alevin-fry splici tutorial altered the fasta output. Based on following the suggestions from the tutorial, they include a step to obtain unique sequences only before outputting the fasta. This was something we were not doing prior when creating our spliced_intron.fa. After looking at the output, it appears that about half of the sequences that are input into the spliced_intron.fa are duplicates and able to align to multiple transcripts. When looking at the sequences, they are duplicate, however the transcript ID's are unique. Additionally, all of the transcript ID's associated with duplicate reads are from intronic regions of transcripts.

This could potentially be a problem when using the spliced_intron.fa with any of the tools that would throw out reads that are multi-mapping. Since we have decided to use this index with snRNA-seq samples with alevin-fry right now, this could impact how reads are assigned to each gene, as the full resolution is using an expectation maximization algorithm, while the cr-like mode assigns the read to the UMI with the highest count for that gene. However, it's unclear to me if we are losing sequences and potential information by adding this step. I reached out to Rob for some clarification and have not heard back yet.

More information on alevin-fry quant can be found here for reference.

Since we are interested in testing the splici index regardless, I went ahead and modified the scripts to generate the splici.fa in addition to the index files we already are generating. I added in a short comparison of the transcript ID's in the compare_spliced_ensembl.Rmd notebook we had, but think the real test is going to be does this alter the downstream results, which we will know after our next round of benchmarking. Happy to hear other thoughts on this!

Thanks so much for the comments @rob-p.

@jashapiro I consolidated the steps so we only have 1 spliced_intron.fa (or splici.fa) but kept the naming consistent with how we had previously had it. So the overall changes to the script here is the addition of the tx2gene_3col.tsv output that we will need.

Since we are now modifying and updating how we create our spliced_intron index, we shouldn't need to make any more code changes, but will need to just re-generate the index with this new fasta. I went ahead and created a subfolder within reference/homo_sapiens/fasta/spliced_intron_no_trim and moved the spliced_intron.fa that we had previously generated to that location. I am planning on doing the same for the index's for now.

AlexsLemonade / alsf-scpca

Update reference scripts to make splici.fa #96