aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
186 stars 29 forks source link

Error when overlapping scRNA-seq annotations to cistopic_obj (argmax of an empty sequence) #416

Open leahyekim opened 5 months ago

leahyekim commented 5 months ago

Hello! Thank you so much for developing & maintaining a great tool :)

Could you please help me figure out why I had to replace NaN values for "Seurat_cell_type" to "Unknown" ?

Following the pycisTopic documentation, I tried annotating each cluster of cistopic_obj based on scRNA-seq annotations.

I ran this:

`annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"].value_counts()
        annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"`

Then I got this error:

ValueError                                Traceback (most recent call last)
Cell In[60], line 8
      4 for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
      5     counts = cistopic_obj.cell_data.loc[
      6         cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
      7         "Seurat_cell_type"].value_counts()
----> 8     annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/base.py:655](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/base.py#line=654), in IndexOpsMixin.argmax(self, axis, skipna, *args, **kwargs)
    651         return delegate.argmax()
    652 else:
    653     # error: Incompatible return value type (got "Union[int, ndarray]", expected
    654     # "int")
--> 655     return nanops.nanargmax(  # type: ignore[return-value]
    656         delegate, skipna=skipna
    657     )

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py:93](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py#line=92), in disallow.__call__.<locals>._f(*args, **kwargs)
     91 try:
     92     with np.errstate(invalid="ignore"):
---> 93         return f(*args, **kwargs)
     94 except ValueError as e:
     95     # we want to transform an object array
     96     # ValueError message to the more typical TypeError
     97     # e.g. this is normally a disallowed function on
     98     # object arrays that contain strings
     99     if is_object_dtype(args[0]):

File [~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py:1104](https://ondemand.carc.usc.edu/node/b10-04.hpc.usc.edu/2774/lab/tree/scplus_P1coc_tryagain/~/.conda/envs/scenicplus/lib/python3.11/site-packages/pandas/core/nanops.py#line=1103), in nanargmax(values, axis, skipna, mask)
   1102 values, mask, _, _, _ = _get_values(values, True, fill_value_typ="-inf", mask=mask)
   1103 # error: Need type annotation for 'result'
-> 1104 result = values.argmax(axis)  # type: ignore[var-annotated]
   1105 result = _maybe_arg_null_out(result, axis, mask, skipna)
   1106 return result

ValueError: attempt to get argmax of an empty sequence

I thought this was due to having NaN values in my cistopic_obj. However, from your SCENIC+ seminar (timestamp included), it was mentioned that having NaN values should not be a problem.

When I checked "annot_dict", I noticed that the clusters from pycisTopic were not labelled with my scRNA-seq annotations as shown below:

{'pycisTopic_leiden_10_0.6': {'2': 'P1-Ko(2)',
  '8': 'P1-SC(8)',
  '3': 'P1-Roof(3)'}}

I was expecting to see the whole list but this is what I only got.

Then I manually replaced NaN values in the "sample_id" column just like other cells (I only have one sample), but that gave me the same error as above.

_Alternatively, I also tried replacing the NaN values in "Seurat_cell_type' with "Unknown" by using this code:_

annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"
        ].value_counts()

        if not counts.empty:
            annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"
        else:
            # Handle cases where counts is empty
            annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"Unknown({cluster})"

This worked, and some pycisTopic clusters was labelled as "Unknown." However, I noticed that I did not have anything in the "Seurat_cell_type" column" for some topic in "topic_annot."

topic_annot = topic_annotation(
    cistopic_obj,
    annot_var='Seurat_cell_type',
    binarized_cell_topic=binarized_cell_topic,
    general_topic_thr = 0.2
)

Is this normal? OR is this due to setting some clusters as "Unknown?"

Also, when I printed "Number of DARs found," I didn't see any "Unknown" clusters like below:

Number of DARs found:
---------------------
  P1-HC: 8767
  P1-Ko: 11348
  P1-Roof: 12662
  P1-SC: 16089
  P1-SL: 8229
  P1-SP: 9786

This goes same as "Number of DAGs found."

Despite several hiccups, I was able to run pycisTopic all the way. _Would this be okay for the downstream steps? What could be the reason why I had to replace the NaN values from the "Seurat_celltype" column to "Unknown?"

Thank you so much!

Best, Leah

Version (please complete the following information): Python: 3.11.4 pycisTopic: 2.0a0 SCENIC+: 1.0a1

SeppeDeWinter commented 5 months ago

Hi Leah

Thank you for the kind words! I'm not sure what is going on here. Could you please show cistopic_obj.cell_data

Best,

Seppe

leahyekim commented 5 months ago

Hi Seppe,

Sorry for the delayed response!

My modified cistopic_obj.cell_data looks like this below (I only screenshot the part of it) :

Screen Shot 2024-06-20 at 11 33 00 PM

In the "Seurat_cell_type" column, I had to replace "NaN" to "Unknown." In the "sample_id" column, I also had to replace "NaN" to "P1-coc" to match other cells. Notice that some "Unknown" in the "Seurat_cell_type" column have "Unknown" values in the "pycisTopic_leiden10_0.x" column - some do not.

Only after making these adjustments, I was finally able to run this:

`annot_dict = {}
for resolution in [0.6, 1.2, 3]:
    annot_dict[f"pycisTopic_leiden_10_{resolution}"] = {}
    for cluster in set(cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"]):
        counts = cistopic_obj.cell_data.loc[
            cistopic_obj.cell_data.loc[cistopic_obj.cell_data[f"pycisTopic_leiden_10_{resolution}"] == cluster].index,
            "Seurat_cell_type"].value_counts()
        annot_dict[f"pycisTopic_leiden_10_{resolution}"][cluster] = f"{counts.index[counts.argmax()]}({cluster})"`

At the end, this led me to have "Unknown.bed" file in "outs/region_sets/DARs_cell_type" directory.

However, I did not save this "modified" cistopic.obj. So the snakemake pipeline will get the original cistopic_obj that has "NaN" values instead of "Unknown."

Would this be okay with running the snakemake pipeline? OR is there any other way to run the code without replacing "NaN" values?

SeppeDeWinter commented 4 months ago

Hi!

I think this should be OK with the Snakemake pipeline.

Best,

Seppe

leahyekim commented 3 months ago

Hi Seppe,

Sorry for the delayed response and thank you so much for your reply. I was able to run through the Snakemake pipeline :)

I'm wondering if I could avoid having those "NaN" values. My current workflow looks like this: Analyze 10x multiome data in Seurat --> identify each cluster in Seurat (because I'm more familiar with Seurat than Scanpy & your sample "cell_data.tsv" file had "Seurat_cell_type" as a column name!) --> Export the cell barcodes and their corresponding cluster ID as "cell_data.tsv" --> Use this "cell_data.tsv" to annotate my cistopic object.

Instead of using Seurat, would identifying each cluster with Scanpy - then exporting "cell_data.tsv" from the Scanpy object help me avoid having "NaN" values in the cistopic object?

Or am I overcomplicating this and having "NaN" values is not a problem?

Best, Leah

uqnsarke commented 3 months ago

Follwoing. I cant filter these "NaN" values.

SeppeDeWinter commented 3 months ago

Hi @leahyekim

I'm not entirely sure where you are getting NaN's. I would suggest manually inspecting all steps to see where these are generated. I guess some cell barcodes in your cistopic object and seurat object are not matching.

All the best,

Seppe