cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
52 stars 7 forks source link

model.get_enriched_TFs returns no TFs for any topics despite motif hits in scanning #16

Closed alonmillet closed 1 year ago

alonmillet commented 1 year ago

Thanks a lot for MIRA, it's been very intuitive and a real pleasure to use so far. Unfortunately I've run into an issue and am not entirely sure about the cause. After motif scanning (~35M hits, model trained on ~125k frags) I generate the expected sparse matrix (I checked, and the sparsity is 0.88, so it's certainly sparse but not all zero). When I then run model_ATAC.get_enriched_TFs(data_ATAC, topic_num=2, top_quantile=0.2), for any permutation of topic number or top_quantile cutoff, it looks like it's finding no peaks, and sure enough, the resulting dataframe of enrichments is empty: image

I've had no issues at all with identifying genes or pathways/processes that are activated by RNA topics, so this is unique to the ATAC model. I'm also pretty confident there should be some differentially expressed peaks, given that the topics are very well defined: image

Any ideas what might be going wrong here? Thanks again for the great package!

alonmillet commented 1 year ago

Update - found the reason! The issue was in the innocuous-looking filtering step, mira.utils.subset_factors(data_ATAC, use_factors=[factor for factor in data_RNA.var_names]). The parsed names called by get_enriched_TFs are all-caps, human style, even though the input fasta was from mouse, so all of the TFs got filtered out since there was no match with my RNA data. Rather than finding homologs and filtering that way, I'll just proceed without filtering at all, but any suggestions on how to do this without homolog calling would be appreciated. (Note that the same issue crops up with training the RP model - even though I use mm10, tl.get_distance_to_TSS populates the metadata with human-formatted gene names which obviously mismatch with the highly variant genes I identify from my mouse transcriptome data.)