EBISPOT / scxa_2_cxg

Apache License 2.0
1 stars 0 forks source link

Error when processing E-MTAB-8698: “Grouper for 'cell_type' not 1-dimensional” #35

Closed gouttegd closed 2 months ago

gouttegd commented 3 months ago

I ran into the following error when trying to process the E-MTAB-8698 dataset (one of the fly datasets):

$ poetry run python src/bulk_experiments.py --study_filter E-MTAB-8698 --chunk_size 10 --download
INFO:root:Processing study: E-MTAB-8698
INFO:root:Downloading files...
[… OUTPUT TRUNCATED FOR BREVITY …]
Traceback (most recent call last):
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/bulk_experiments.py", line 90, in <module>
    bulk_process(args.study_filter, args.chunk_size, args.download, args.modified)
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/bulk_experiments.py", line 75, in bulk_process
    generate_rdf(study_path, ["cluster_nb"], output_path)
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/bulk_experiments.py", line 30, in generate_rdf
    aea.analyzer_manager.co_annotation_report()
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandasaurus_cxg/anndata_analyzer.py", line 138, in co_annotation_report
    AnndataAnalyzer._assign_predicate_column(co_oc, field_name_1, field_name_2)
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandasaurus_cxg/anndata_analyzer.py", line 214, in _assign_predicate_column
    co_oc.groupby(field_name_1, observed=True)[field_name_2].apply(list).to_dict()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/frame.py", line 9183, in groupby
    return DataFrameGroupBy(
           ^^^^^^^^^^^^^^^^^
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/groupby/groupby.py", line 1329, in __init__
    grouper, exclusions, obj = get_grouper(
                               ^^^^^^^^^^^^
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 1038, in get_grouper
    raise ValueError(f"Grouper for '{name}' not 1-dimensional")
ValueError: Grouper for 'cell_type' not 1-dimensional

This is during the RDF conversion step. The step that converts the SCXA H5AD file to the CxG scheme seemingly goes well (no error), but the resulting …_modified.project.h5ad file does not seem entirely correct. For example, there are two cell_type columns in the obs section (which I suspect might be the cause of the “Grouper for 'cell_type' not 1-dimensional” error above), as well as two cell_type_ontology_term_id columns:

> data = anndata.read_h5ad('E-MTAB-8698_modified.project.h5ad')
> print(data.obs.columns)
Index(['age', 'cell_type', 'development_stage', 'genotype', 'individual',
       'tissue', 'organism', 'sex', 'stimulus', 'strain', 'cell_type',
       'authors_cell_type', 'infect', 'age_ontology',
       'cell_type_ontology_term_id', 'development_stage_ontology_term_id',
       'genotype_ontology', 'individual_ontology', 'tissue_ontology_term_id',
       'organism_ontology_term_id', 'sex_ontology_term_id',
       'stimulus_ontology', 'strain_ontology', 'cell_type_ontology_term_id',
       'authors_cell_type_ontology', 'infect_ontology', 'doublet_score',
       'predicted_doublet', 'n_genes_by_counts', 'log1p_n_genes_by_counts',
       'total_counts', 'log1p_total_counts', 'total_counts_mito',
       'log1p_total_counts_mito', 'pct_counts_mito', 'n_counts', 'n_genes',
       'louvain_resolution_0.1', 'louvain_resolution_0.3',
       'louvain_resolution_0.5', 'louvain_resolution_0.7',
       'louvain_resolution_1.0', 'louvain_resolution_2.0',
       'louvain_resolution_3.0', 'louvain_resolution_4.0',
       'louvain_resolution_5.0', 'assay_ontology_term_id', 'cluster_nb'],
      dtype='object')

And the contents of those columns seem bogus as well:

> print(d2.obs['cell_type'])
                             cell_type  cell_type
ERR3833766-GAATAAGGTTCAACCA  cell_type  cell_type
ERR3833766-GATTCAGAGTGACATA  cell_type  cell_type
ERR3833766-GAGTCCGGTCAATGTC  cell_type  cell_type
ERR3833766-AACGTTGGTGTAAGTA  cell_type  cell_type
ERR3833766-GATTCAGTCATTTGGG  cell_type  cell_type
...                                ...        ...
ERR3833765-GTCGGGTAGTACCGGA  cell_type  cell_type
ERR3833765-GGCGTGTAGAAACCGC  cell_type  cell_type
ERR3833765-CTACACGACGCTCTTC  cell_type  cell_type
ERR3833765-GACTCGAAGGTCATCT  cell_type  cell_type
ERR3833765-AGCTCGAAGGTCATCT  cell_type  cell_type

[15040 rows x 2 columns]
anitacaron commented 3 months ago

This issue was related to the issue #32. Before, I wasn't checking if there was already a cell_type to rename author_cell_type_ontology to cell_type. I fixed it. Can you double check it, please?

gouttegd commented 2 months ago

I just re-tested against E-MTAB-8698. The conversion ended without errors and the cell_type column was populated correctly. Looks all good to me.

Thanks!