czbiohub-sf / tabula-muris-senis

Tabula Muris Senis
http://tabula-muris-senis.ds.czbiohub.org
BSD 3-Clause "New" or "Revised" License
96 stars 27 forks source link

mm10plus #30

Open hanhyebin opened 3 years ago

hanhyebin commented 3 years ago

Hi,

I see that tabula muris senis used "mm10plus" as genome reference. I am assuming it is a modified version of mm10. If so, may I know what modifications/adjustments were made?

Thanks!

aopisco commented 3 years ago

@hanhyebin the reference genome is available at s3://czb-tabula-muris-senis/reference-genome/

hanhyebin commented 3 years ago

Thanks but I wanted to know more so how if differs from mm10.

The reason I ask is that I am trying to integrate this dataset with other datasets and if mm10plus is much different than mm10, I will need to realign it to mm10 (which I can do) but if there is not much difference between the two, I can continue to use it as is.

Thank you in advance.

txemaheredia commented 3 years ago

I am also having issues with this.

I have downloaded the .h5ad files for all datasets, and I find in the matrices genes that are not present in this release.

For example, the dataset droplet-Liver contains the gene "Fam150a". I have just downloaded the reference .tgz from aws, and the gene "Fam150a" (nor "Fam150", nor "am150") do not exist in gencode.vM19/genes/genes.gtf

Out of 20138 genes in the object matrix, there are 2081 genes that do not exist in the gencodeM19 gtf file.

> length(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name])
[1] 2081
> head(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name], 20)
 [1] "Fam150a"       "3110035E14Rik" "6030422M02Rik" "4932411L15"    "Gm106"         "Tceb1"        
 [7] "1110058L19Rik" "Bai3"          "Fam123c"       "4632411B12Rik" "6330578E17Rik" "D1Bwg0212e"   
[13] "2610017I09Rik" "2900092D14Rik" "A530098C11Rik" "1700029F09Rik" "4832428D23Rik" "Dnahc7b"      
[19] "Sdpr"          "Obfc2a" 

I've found some random gtf file in the internet when googling for mm10plus ( http://waxmanlabvm.bu.edu/kkarri/G171/ref/updated-usethis-mm10plus-pcg-ercc-lnc-nodups-mcherry/genes/ ). This file does indeed include the genes "Fam150a" and "Tceb1". It doesn't match 100% of the genes present in the object matrix. However, this file contains 426 genes that were not present in the gencodeM19 file.

> sum(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name] %in% mm10plus_genes$gene_name)
[1] 426

Which annotation was used to create the matrices for these datasets? Am I messing this up big time, or is there a serious mismatch between the data matrices and the gtf files provided?

Thanks in advance