Gene sets in Salmon output differ between experiments

Context

I am attempting to download the raw (unnormalized) Salmon counts for a set of human RNA-seq samples that span multiple studies (experiments) and compile them into a single read count matrix. Using the API, I have created a dataset with two samples from each of two experiments (3b4441fe-1522-4db7-951e-1db9fa9c82bc), with aggregation set to "EXPERIMENT" and quantile_normalize set to False.

Problem or idea

The problem is that the read count tables for the two experiments differ in their gene sets (annotation), and thus the tables cannot simply be joined. Of course, a trivial workaround is to subset to the intersection of the gene sets. But it is surprising that the gene sets differ, given that both were processed by Salmon with the same genome build, and there is no mention of a different version of gene annotation being used. When I check the details of the processing of these experiments, it does appear that versions of Salmon used for the two experiments are different (0.9.1 and 0.13.1), but I would not expect that to change the gene set.

Solution or next step

Ideally, all samples would be processed with the same version of Salmon and the same annotation. If a different annotation was used for different experiments, this should be documented in the processing details for that experiment. If the annotation was actually the same for these two experiments, then some investigation into why the outputs have different sets of genes should be done.

Hi @cndewey, thanks for filing this! I definitely agree that it would be ideal to process all samples with the same version of Salmon.

Between 0.9.1 and 0.13.1, we were required to rebuild all Salmon indices. The indices used for the two experiments in your dataset (ERP108370 and ERP114365) are listed here: https://api.refine.bio/v1/transcriptome_indices/?organism_name=HOMO_SAPIENS&length=LONG

Although I can imagine a situation where differences that we see at the gene-level could arise from changes in Salmon indexing behavior between versions, it's more likely that the difference you're reporting is due to a change in annotation.

The genome build is the same between the two indices (GRCh38) as you noted, but the Ensembl releases used are different as indicated by the source_version field. The 0.9.1 index used Ensembl release 93; the 0.13.1 index used Ensembl release 96. I found that the 693 genes present in the experiment processed with 0.9.1 (ERP108370) only were not present in the genes to transcript mapping in the newer index (0.13.1). A similar pattern is evident at the transcript level (comparing the quant.sf files from the different experiments to GTF files in AnnotationHub). This leads me to conclude that the reported differences come from the change in release.

source_version is also reported as part of the sample results: https://api.refine.bio/v1/samples/?accession_codes=ERR2534073

I think there are a number of areas that we can improve on our side.

First and foremost, the name source_version is less useful/helpful than it could be for discovering this information. Also, the sample modal on the web interface only reports the genome build, not the Ensembl release:

(From a UX point of view - I assume that you did not look at the UI at all when you requested this dataset but please let me know if that is not the case.)

I'm also assuming default behavior for building indices is to use the most recent version of Ensembl, which we should put a bit more thought into.

In any case, information about differences in gene sets should go into an FAQ in our documentation in the meantime. I've filed https://github.com/AlexsLemonade/refinebio-docs/issues/136 to track that documentation change.

Thank you again for bringing this up!

AlexsLemonade / refinebio