atarashansky / SAMap

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.
MIT License
66 stars 19 forks source link

Functional Enrichment with GO terms not easily working #84

Closed philoel closed 2 years ago

philoel commented 2 years ago

Hi again, now that I can get much further through the analysis I'm bumping into an issue with the Functional Enrichment section.

(I'm using SAMap 1.0.2, and I'm working with the vignette's example data and with my own data, in parallel, to make sure it isn't a problem with my own data again)

First: I think the vignette needs to be updated, the columns in your example emapper tables have names which do not match the current emapper annotation column names as noted here: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.8#annotations-file . This might not be such a problem, except for example the vignette command for KOG enrichment is

FE = FunctionalEnrichment(sm,EGGs, col_key='KOG', keys=keys, align_thr=0.05)

and there is no such column in the (new?) emapper columns. The vignette command works for the example data but not for my own data. For my own data, I selected "COG_category" which seems to have entries similar to the column from the example emapper annotation file, but it gives an error. That error is the same as what comes up in the next issue, which I have copied and pasted the command and the outputs for into a .txt file to be attached.

Second: for me, the GO term enrichment was more attractive than the KOG (COG?) enrichment, but the example data in the vignette did not successfully give a nice result. The attached .txt file shows the command I used to generate the FE object, and its (normal-looking) output, and then the subsequent command

ENRICHMENT_SCORES,NUM_ENRICHED_GENES,ENRICHED_GENES = FE.calculate_enrichment(verbose=True)

which threw some error I can't solve on my own. That's the same error that comes up in the first issue I mentioned, above. Just to be clear, that happens with the example data as well as with my own data.

Does it have to do with the fact that the GO terms are all concatenated and separated by a comma? Do I need to manipulate the annotation table somehow?

I've attached also the top 50 genes from my emapper annotations table, generated with the latest emapper as a tab-delimited .txt file. This is in case you have any questions about the structure of the emapper annotations file for my own data, for example how the GO term or COG_categories columns look.

Thanks again! Phil head_emapper_outs.txt SAMap_GOtermFE_exampledata.txt

philoel commented 2 years ago

Update: I added in the delimiter parameter, and it still failed in the same way.

So, I wrote: FE = FunctionalEnrichment(sm, EGGs, col_key='GOs', keys=keys, align_thr=0.05, delimiter = ',')

And that computed something since it listed off all the pairs of cell types it calculated gene pairs for, but then gave teh same error traceback as I shared earlier.

atarashansky commented 2 years ago

Looking into this... Will report back. So far, I can't see what exactly is going wrong.

The FunctionalEnrichment code is something I wrote as a bespoke pipeline to analyze my own data and so unfortunately it's super brittle. I'm sorry for all the issues you're uncovering! I graduated more than a year ago and unfortunately my former lab has not found another person to help me continue maintaining this package, and as there are a lot of moving parts, things tend to break easily. I appreciate your patience!

atarashansky commented 2 years ago

This is unorthodox, but could you install from the latest source on main? I added a couple print statements outputting the problem variables that I'd like to see from your end.

(installing from source)

git clone https://github.com/atarashansky/SAMap
cd SAMap
pip install .
philoel commented 2 years ago

Not a problem at all, it's a cool tool and I'm just happy to be making progress with it. I guess this is one way to smoothen things out!

Before I re-install, can you clarify by "on main"? I would be happy to re-install in a fresh conda environment using exactly what you specify, or are you asking me to install it outside of conda?

Thanks, Phil

edit I'll just install in a fresh conda for now, so I can get you the error messages you are looking for, and I'll install outside of conda if that's what you were really asking, if you insist ;)

philoel commented 2 years ago

Alright, I installed a new SAMap exactly the way you described, in a new conda environment but from the yaml file of the previous environment since I found it was a bit tricky to track down all the dependencies one-by-one.

I re-ran the following code:

keys = {'ad':'adjusted_annotations_20220408','jp':'adjusted_annotation_20220408', 'am':'adjusted_annotations_20220316'}

FE = FunctionalEnrichment(sm, EGGs, col_key='GOs', keys=keys, align_thr=0.05, delimiter = ',')

and it gave what feels like reasonable output; here is the top lines:

Finding enriched gene pairs...
Finding cluster-specific markers in ad:adjusted_annotations_20220408.
Finding cluster-specific markers in am:adjusted_annotations_20220316.
Finding cluster-specific markers in jp:adjusted_annotation_20220408.
Calculating gene pairs for the mapping: ad;AC __ Cholinergic __ CLDN19+ to am;Onecut-_VACht+_CholAC4-like
Calculating gene pairs for the mapping: ad;AC __ Cholinergic __ CLDN19+ to jp;RGC-like_Onecut+
Calculating gene pairs for the mapping: ad;AC __ Cholinergic __ CLDN19- to am;Onecut-_VACht+_CholAC4-like
Calculating gene pairs for the mapping: ad;AC __ Cholinergic __ CLDN19- to jp;RGC-like_Onecut+
Calculating gene pairs for the mapping: ad;AC __ GABAergic __ <indistinct> to am;ALK+_Opn4+_HC-like
Calculating gene pairs for the mapping: ad;AC __ GABAergic __ <indistinct> to am;Onecut+_VACht+_HC-like
Calculating gene pairs for the mapping: ad;AC __ GABAergic __ <indistinct> to jp;RGC-like_Onecut+
Calculating gene pairs for the mapping: ad;AC __ GABAergic __ FHL2+ to am;ALK+_Opn4+_HC-like

Then, I ran the troublesome next line:

ENRICHMENT_SCORES,NUM_ENRICHED_GENES,ENRICHED_GENES = FE.calculate_enrichment(verbose=True)

and the results I got were the same thing iterated over each of my cell types in the samap object; here is the top 10 lines:

Calculating functional enrichment for cell type ad_AC __ Cholinergic __ CLDN19+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ Cholinergic __ CLDN19-
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ <indistinct>
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ FHL2+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ Likely_subtypes
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ Melanopsin+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ NOS.a+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ NOSb+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ PTH1R+_likely_subtypes_includingCRH+
DEBUGGING_LINE
['G' 'O']
[2. 1.]
Calculating functional enrichment for cell type ad_AC __ GABAergic __ SIX6+
DEBUGGING_LINE
['G' 'O']
[2. 1.]

I also saved out the three objects generated by this, using

ENRICHMENT_SCORES.to_csv('./adamjp_FunctionalEnrichment_goterms_ENRICHMENT_SCORES.csv')
NUM_ENRICHED_GENES.to_csv('./adamjp_FunctionalEnrichment_goterms_NUM_ENRICHED_GENES.csv')
ENRICHED_GENES.to_csv('./adamjp_FunctionalEnrichment_goterms_ENRICHED_GENES.csv')

And I have attached those outputs here. I can already get a sense for what the code was trying to do, though I don't understand enough to suggest a debugging strategy.

Best, Phil

adamjp_FunctionalEnrichment_goterms_NUM_ENRICHED_GENES.csv adamjp_FunctionalEnrichment_goterms_ENRICHMENT_SCORES.csv adamjp_FunctionalEnrichment_goterms_ENRICHED_GENES.csv

atarashansky commented 2 years ago

To be clear, did this latest attempt fail with the same error? If so, could you post the last 10 lines or so leading up to the failure?

philoel commented 2 years ago

The latest attempt had a different output, it did not list an error this time.

So, the first time (which started this git issue) gave the output found on lines 225 and below of the attached SAMap_GOtermFE_exampledata.txt file (attached in first post, at the top).

This new time, I loaded the same FE object from pickle, then ran the command

ENRICHMENT_SCORES,NUM_ENRICHED_GENES,ENRICHED_GENES = FE.calculate_enrichment(verbose=True)

which then gave the output which I now attach in its entirety to this message. It doesn't seem to be an error message exactly, although looking at the outputs (which I attached to the last message) it's clear that something still went wrong.

Hope that's clearer! Phil

calculate_enrichment_outputs_1Aug2022.txt

atarashansky commented 2 years ago

Ok, now try running this after reinstalling from github (just use the same environment you used last):

keys = {'ad':'adjusted_annotations_20220408','jp':'adjusted_annotation_20220408', 'am':'adjusted_annotations_20220316'}

FE = FunctionalEnrichment(sm, EGGs, col_key='GOs', keys=keys, align_thr=0.05, delimiter = ',')

It should print something out... It seems like it's incorrectly parsing the tables.

philoel commented 2 years ago

Alright, so the last change seems to have enabled it to read the table, based on the output (attached, functionalenrichment_output.txt).

The next step, with code

ENRICHMENT_SCORES,NUM_ENRICHED_GENES,ENRICHED_GENES = FE.calculate_enrichment(verbose=True)

, seems to hang for a long time. It generates the following single line,

Calculating functional enrichment for cell type ad_AC __ Cholinergic __ CLDN19+

and then pauses there for so far up to an hour; I restarted twice in case it was just hanging off a bad connection, but it really seems to be consistently taking forever at this first step. Is this duration normal for the calculations per cell type, or should it proceed much quicker? There's no more informative output for me to copy, however. In the mean time, I'll let it run over night, maybe we get more progress...?

functionalenrichment_output.txt

atarashansky commented 2 years ago

Ok, so it looks like the right go table is getting loaded in...

Update the code again from github, and when it hangs, press the stop (square) button on your jupyter notebook and then show me the output & error traceback. I'd like to see where it's getting stuck.

philoel commented 2 years ago

This time, the outputs generated before I hit stop looked a little different, so I let it run for 5 minutes before deciding if it was hanging again.

This is what I had after 5 minutes:

Calculating functional enrichment for cell type ad_AC __ Cholinergic __ CLDN19+
['-' 'GO:0000003' 'GO:0000015' ... 'GO:2001275' 'GO:2001293' 'GO:2001295']
6775
6775

And after I pressed stop, this is the output and traceback that was printed:

Calculating functional enrichment for cell type ad_AC __ Cholinergic __ CLDN19+
['-' 'GO:0000003' 'GO:0000015' ... 'GO:2001275' 'GO:2001293' 'GO:2001295']
6775
6775
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/scratch/jobs/46526462/ipykernel_43606/3470833824.py in <module>
----> 1 ENRICHMENT_SCORES,NUM_ENRICHED_GENES,ENRICHED_GENES = FE.calculate_enrichment(verbose=True)
      2 
      3 # Pickle these afterward so I can use them more quickly next time
      4 import pickle
      5 

/path-to-samap/samap_102_1Aug2022/SAMap/samap/analysis.py in calculate_enrichment(self, verbose)
    338                     print(goterms)
    339                     print(goterms.size)
--> 340                     result = GOEA(gi,GENE_SETS,goterms=goterms,fdr_thresh=100,p_thresh=100)
    341 
    342                     lens = np.array([len(np.unique(x.split(';'))) for x in result['genes'].values])

/path-to-samap/samap_102_1Aug2022/SAMap/samap/analysis.py in GOEA(target_genes, GENE_SETS, df_key, goterms, fdr_thresh, p_thresh)
    106             num_iter = min(n,B)
    107             rng = np.arange(b,num_iter+1)
--> 108             probs.append(sum([np.exp(_log_binomial(n,i)+_log_binomial(N-n,B-i) - _log_binomial(N,B)) for i in rng]))
    109         else:
    110             probs.append(1.0)

/path-to-samap/samap_102_1Aug2022/SAMap/samap/analysis.py in <listcomp>(.0)
    106             num_iter = min(n,B)
    107             rng = np.arange(b,num_iter+1)
--> 108             probs.append(sum([np.exp(_log_binomial(n,i)+_log_binomial(N-n,B-i) - _log_binomial(N,B)) for i in rng]))
    109         else:
    110             probs.append(1.0)

/path-to-samap/samap_102_1Aug2022/SAMap/samap/analysis.py in _log_binomial(n, k)
      9     return np.log(np.arange(1,n+1)).sum()
     10 def _log_binomial(n,k):
---> 11     return _log_factorial(n) - (_log_factorial(k) + _log_factorial(n-k))
     12 
     13 def GOEA(target_genes,GENE_SETS,df_key='GO',goterms=None,fdr_thresh=0.25,p_thresh=1e-3):

/path-to-samap/samap_102_1Aug2022/SAMap/samap/analysis.py in _log_factorial(n)
      7 
      8 def _log_factorial(n):
----> 9     return np.log(np.arange(1,n+1)).sum()
     10 def _log_binomial(n,k):
     11     return _log_factorial(n) - (_log_factorial(k) + _log_factorial(n-k))

KeyboardInterrupt: 

Thanks!

philoel commented 2 years ago

Hi Alec! Just checking in on this front.

atarashansky commented 2 years ago

Sorry this hasn't been fixed. It's hard to reproduce without being able to run and debug on my end.

I'm planning on deprecating this functionality anyway to streamline the SAMap package for easier maintenance. There are a few methods outside of SAMap's core functionality that I wrote ad hoc for my own work and chucked in analysis.py, which have been difficult to maintain solo in my spare time, FunctionalEnrichment being one of them.