aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
162 stars 27 forks source link

Error when overlapping scRNA-seq annotations to cistopic_obj (argmax of an empty sequence) #416

Open leahyekim opened 2 weeks ago

leahyekim commented 2 weeks ago

Hello! Thank you so much for developing & maintaining a great tool :)

Could you please help me figure out why I had to replace NaN values for "Seurat_cell_type" to "Unknown" ?

Following the pycisTopic documentation, I tried annotating each cluster of cistopic_obj based on scRNA-seq annotations.

I ran this:

`annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"].value_counts()
        annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"`

Then I got this error:

ValueError                                Traceback (most recent call last)
Cell In[60], line 8
      4 for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
      5     counts = cistopic_obj.cell_data.loc[
      6         cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
      7         "Seurat_cell_type"].value_counts()
----> 8     annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/base.py:655](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/base.py#line=654), in IndexOpsMixin.argmax(self, axis, skipna, *args, **kwargs)
    651         return delegate.argmax()
    652 else:
    653     # error: Incompatible return value type (got "Union[int, ndarray]", expected
    654     # "int")
--> 655     return nanops.nanargmax(  # type: ignore[return-value]
    656         delegate, skipna=skipna
    657     )

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py:93](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py#line=92), in disallow.__call__.<locals>._f(*args, **kwargs)
     91 try:
     92     with np.errstate(invalid="ignore"):
---> 93         return f(*args, **kwargs)
     94 except ValueError as e:
     95     # we want to transform an object array
     96     # ValueError message to the more typical TypeError
     97     # e.g. this is normally a disallowed function on
     98     # object arrays that contain strings
     99     if is_object_dtype(args[0]):

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py:1104](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py#line=1103), in nanargmax(values, axis, skipna, mask)
   1102 values, mask, _, _, _ = _get_values(values, True, fill_value_typ="-inf", mask=mask)
   1103 # error: Need type annotation for 'result'
-> 1104 result = values.argmax(axis)  # type: ignore[var-annotated]
   1105 result = _maybe_arg_null_out(result, axis, mask, skipna)
   1106 return result

ValueError: attempt to get argmax of an empty sequence

I thought this was due to having NaN values in my cistopic_obj. However, from your SCENIC+ seminar (timestamp included), it was mentioned that having NaN values should not be a problem.

When I checked "annot_dict", I noticed that the clusters from pycisTopic were not labelled with my scRNA-seq annotations as shown below:

{'pycisTopic_leiden_10_0.6': {'2': 'P1-Ko(2)',
  '8': 'P1-SC(8)',
  '3': 'P1-Roof(3)'}}

I was expecting to see the whole list but this is what I only got.

Then I manually replaced NaN values in the "sample_id" column just like other cells (I only have one sample), but that gave me the same error as above.

_Alternatively, I also tried replacing the NaN values in "Seurat_cell_type' with "Unknown" by using this code:_

annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"
        ].value_counts()

        if not counts.empty:
            annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"
        else:
            # Handle cases where counts is empty
            annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"Unknown({cluster})"

This worked, and some pycisTopic clusters was labelled as "Unknown." However, I noticed that I did not have anything in the "Seurat_cell_type" column" for some topic in "topic_annot."

topic_annot = topic_annotation(
    cistopic_obj,
    annot_var='Seurat_cell_type',
    binarized_cell_topic=binarized_cell_topic,
    general_topic_thr = 0.2
)

Is this normal? OR is this due to setting some clusters as "Unknown?"

Also, when I printed "Number of DARs found," I didn't see any "Unknown" clusters like below:

Number of DARs found:
---------------------
  P1-HC: 8767
  P1-Ko: 11348
  P1-Roof: 12662
  P1-SC: 16089
  P1-SL: 8229
  P1-SP: 9786

This goes same as "Number of DAGs found."

Despite several hiccups, I was able to run pycisTopic all the way. _Would this be okay for the downstream steps? What could be the reason why I had to replace the NaN values from the "Seurat_celltype" column to "Unknown?"

Thank you so much!

Best, Leah

Version (please complete the following information): Python: 3.11.4 pycisTopic: 2.0a0 SCENIC+: 1.0a1

SeppeDeWinter commented 2 weeks ago

Hi Leah

Thank you for the kind words! I'm not sure what is going on here. Could you please show cistopic_obj.cell_data

Best,

Seppe

leahyekim commented 1 week ago

Hi Seppe,

Sorry for the delayed response!

My modified cistopic_obj.cell_data looks like this below (I only screenshot the part of it) :

Screen Shot 2024-06-20 at 11 33 00 PM

In the "Seurat_cell_type" column, I had to replace "NaN" to "Unknown." In the "sample_id" column, I also had to replace "NaN" to "P1-coc" to match other cells. Notice that some "Unknown" in the "Seurat_cell_type" column have "Unknown" values in the "pycisTopic_leiden10_0.x" column - some do not.

Only after making these adjustments, I was finally able to run this:

`annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"].value_counts()
        annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"`

At the end, this led me to have "Unknown.bed" file in "outs/region_sets/DARs_cell_type" directory.

However, I did not save this "modified" cistopic.obj. So the snakemake pipeline will get the original cistopic_obj that has "NaN" values instead of "Unknown."

Would this be okay with running the snakemake pipeline? OR is there any other way to run the code without replacing "NaN" values?