atarashansky / SAMap

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.
MIT License
63 stars 19 forks source link

"IndexError: boolean index did not match indexed array along dimension 0" in SAMAP call #102

Closed lb15 closed 1 year ago

lb15 commented 1 year ago

Hi,

I'm mapping using the names parameter in SAMAP, but running into an error. I think it's related to the names and trying to match the FASTA IDs in the maps output from the blast script to those in the sams and/or names files. I'm new to python though, so I'm struggling to pinpoint which file or object may not be correct. Would appreciate any help! Thank you!

sm = SAMAP(
        sams,
        f_maps = 'prot_comp/maps/',
        names = { 'sc' : map1, 'mm' : map2}
)
Not updating the manifold...
Not updating the manifold...
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/ms/pw8bch5j6d7d_rs1pbmm07pc0000gn/T/ipykernel_62031/2773541456.py in <module>
      2         sams,
      3         f_maps = 'prot_comp/maps/',
----> 4         names = { 'sc' : map1, 'mm' : map2}
      5 )

~/anaconda3/envs/SAMap/lib/python3.7/site-packages/samap/mapping.py in __init__(self, sams, f_maps, names, keys, resolutions, gnnm, save_processed, eval_thr)
    144             if names is not None:
    145                 gnnm, gns_dict, gns  = _coarsen_blast_graph(
--> 146                     gnnm, gns, names
    147                 )
    148 

~/anaconda3/envs/SAMap/lib/python3.7/site-packages/samap/mapping.py in _coarsen_blast_graph(gnnm, gns, names)
   1200 
   1201     DF=pd.DataFrame(data=xgyg[filt][:,None],columns=['key'])
-> 1202     DF['val']=da[filt]
   1203 
   1204     dic = df_to_dict(DF,key_key='key')

IndexError: boolean index did not match indexed array along dimension 0; dimension is 2739934 but corresponding boolean dimension is 2739952

I process the sams like this, 1 is already preprocessed (converted from a Seurat object), the other is the raw data from 10x files.

sam1=SAM()
sam1.load_data(filename1)
sam1.run(batch_key="orig.ident")

sam2=SAM()
sam2.load_data(filename2)
sam2.preprocess_data()
sam2.run(batch_key="batch")

sams = {'sc':sam2,'mm':sam1}

And here is what the names and maps files look like:

map1
[('SCA00001', 'Unchar_1'),
 ('SCA00002', 'Unchar_2'),
 ('SCA00003', 'Unchar_3'),
 ('SCA00004', 'Jhamt-1'),
 ('SCA00005', 'Unchar_4'),
 ('SCA00006', 'Unchar_5'),
 ('SCA00007', 'Jhamt-2'),
 ('SCA00008', 'Rtbs-1'),
map2
[('ENSMUSP00000080991', 'mt-Nd1'),
 ('ENSMUSP00000080992', 'mt-Nd2'),
 ('ENSMUSP00000080993', 'mt-Co1'),
 ('ENSMUSP00000080994', 'mt-Co2'),
 ('ENSMUSP00000080995', 'mt-Atp8'),
 ('ENSMUSP00000080996', 'mt-Atp6'),
 ('ENSMUSP00000080997', 'mt-Co3'),
head sc_to_mm.txt
SCA00031    ENSMUSP00000030257  21.824  307 179 11  1   257 855 1150    1.82e-09    59.7
SCA00031    ENSMUSP00000095568  21.824  307 179 11  1   257 855 1150    2.12e-09    59.3
SCA00033    ENSMUSP00000108078  32.653  686 391 19  352 982 257 926 2.79e-92    317
SCA00033    ENSMUSP00000028259  32.653  686 391 19  352 982 306 975 3.11e-92    317
SCA00033    ENSMUSP00000042433  30.257  661 403 18  356 982 245 881 3.54e-79    279
head mm_to_sc.txt
ENSMUSP00000143313  SCA32111    27.027  111 74  2   6   114 9   114 5.19e-07    47.4
ENSMUSP00000143313  SCA32111    27.027  111 74  2   6   114 9   114 5.41e-07    47.4
ENSMUSP00000143313  SCA32111    27.027  111 74  2   6   114 9   114 5.85e-07    47.4

and the adata objects:

adata1.var_names
Index(['Xkr4', 'Gm37381', 'Rp1', 'Mrpl15', 'Lypla1', 'Gm37988', 'Tcea1',
       'Rgs20', 'Gm16041', 'Atp6v1h',
       ...
       'Pcdha11.1', 'Gm17732', '1700120E14Rik', 'Gal3st3', 'Olfr1489', 'Dkk1',
       'Lipo2', 'Cyp2c55', 'Golga7b', 'Pax2'],
      dtype='object', length=20792)

adata2.var_names

Index(['Unchar_1', 'Unchar_2', 'Unchar_3', 'Jhamt-1', 'Unchar_4', 'Unchar_5',
       'Jhamt-2', 'Rtbs-1', 'Rtbs-2', 'Rtbs-3',
       ...
       'Ska1-2', 'Unchar_16811', 'Agre1-11', 'Agre1-12', 'Nxpe2-5',
       'Unchar_16812', 'Unchar_16813', 'Unchar_16814', 'Unchar_16815',
       'Tdh-4'],
      dtype='object', length=40411)
atarashansky commented 1 year ago

Thanks for reporting this. I think I know what the issue is. I uploaded an attempt at a fix. Can you try installing from the source repo directly and try again?

git clone https://github.com/atarashansky/samap cd samap pip install . Or, if you already have the repo cloned: cd samap git pull origin main pip install . (Make sure to restart your python kernel after installing the update so that the imports update properly.)

Please let me know if that didn't resolve your issue.

lb15 commented 1 year ago

Thanks for your quick reply. I created a new conda environment and reinstalled via your instructions, but unfortunately the same error pops up.

atarashansky commented 1 year ago

@lb15 In order to unblock you as fast as possible, would you mind sharing the two mapping tables? sc_to_mm.txt and mm_to_sc.txt. That's all I need to be able to debug the issue. If you're okay with that, you can email them to me (or a google drive link) at tarashan@stanford.edu.

atarashansky commented 1 year ago

Found the bug! It looks like SAMap doesn't like it when you have duplicate genes provided in the names argument. Fixing it now and I'll update you when to try again!

image
atarashansky commented 1 year ago

@lb15 Please install from github again (or install with pip install samap==1.0.13) and try again!

lb15 commented 1 year ago

The issue is fixed, thank you!