calico / solo

software to detect doublets
MIT License
82 stars 13 forks source link

Dataset10x Failing to Load #51

Closed drneavin closed 3 years ago

drneavin commented 3 years ago

Hello,

I'm not sure the best location to put this issue - it arises when using the solo package but I'm fairly certain that the issue lies with the scvi package. I am getting the following output with the error:


[2020-11-16 15:18:18,195] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2020-11-16 15:18:18,202] DEBUG - scvi.dataset.dataset10X | Loading extracted local 10X dataset with custom filename
[2020-11-16 15:18:18,202] INFO - scvi.dataset.dataset10X | Preprocessing dataset
/opt/conda/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729021865/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py36/bin/solo", line 33, in <module>
    sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')())
  File "/opt/solo/solo/solo.py", line 123, in main
    dense=True)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 156, in __init__
    delayed_populating=delayed_populating,
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset.py", line 2026, in __init__
    self.populate()
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 196, in populate
    gene_names = measurements_info[self.measurement_names_column].astype(np.str)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 1

I can replicate this error when in python after importing scvi and trying to load the dataset with Dataset10X but can't identify why I am receiving this error.

The data is a dataset where the barcodes are letters followed by a dash, a number, and sometimes another letter, ie:

AACCGCGGTTGGTTTG-16
AACTCAGCACGGTAAG-16
ACCTTTACAACAACCT-5
ACGCAGCCAATGAAAC-9
ACGGAGAGTCAGATAA-9
ACGGGCTGTTTACTCT-14
ACTGATGTCTTGCAAG-4
ACTTTCAGTCTCTTTA-9
AGGTCATCAAACAACA-4D

and the files are produced with umitools_to_mtx from the R scrunchy package. I have a feeling the main problem is due somehow related to the fact that these files were not directly produced by the 10x cellranger pipeline. Here's the top of the matrix.mtx file:

%%MatrixMarket matrix coordinate integer general
%
20469 15266 4765018
1 1 1
90 1 1
129 1 1
169 1 1
170 1 13
245 1 1

I'll do some more digging but would love if you have some recommendations or input.

Thanks!

njbernstein commented 3 years ago

Hi @drneavin ,

I'm not sure what the issue is exactly, and you are right it seems to be a scvi issue. However, they are currently updating their code, so if its a true bug on their end thing might get tough. However they might have some suggestions about things to try.

Another option is you can try reading your file using scanpy into python and then write an anndata file. And then running solo on that file. Sorry, I'm of more help.

taking a quick look at some files I have. the ones you posted seem normal.

nicholas@sci-pvm-nicholas:~$ head matrix.mtx 
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.1.0"}
33646 4745 4759983
33574 1 30
33567 1 69
33566 1 28
33559 1 522
33558 1 50
33551 1 788
33509 1 45
nicholas@sci-pvm-nicholas:~$ head barcodes.tsv 
AAACCCAGTAAGATCA-1
AAACCCATCAGAGCAG-1
AAACGAACACAAATAG-1
AAACGAACAGATTAAG-1
AAACGAAGTTGCCATA-1
AAACGAATCAGGTGTT-1
AAACGCTAGAGTCTTC-1
AAACGCTAGATGTAGT-1
AAACGCTGTCAAGTTC-1
AAACGCTGTGACTCGC-1
nicholas@sci-pvm-nicholas:~$ head features.tsv 
ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 OR4F5   Gene Expression
ENSG00000238009 AL627309.1  Gene Expression
ENSG00000239945 AL627309.3  Gene Expression
ENSG00000239906 AL627309.2  Gene Expression
ENSG00000241599 AL627309.4  Gene Expression
ENSG00000236601 AL732372.1  Gene Expression
ENSG00000284733 OR4F29  Gene Expression
ENSG00000235146 AC114498.1  Gene Expression
drneavin commented 3 years ago

Thanks for the recommendation @njbernstein!

I just finished checking on scanpy loading and that seems to be failing as well. However, some of my other codes that use just scipy to read the matrix in work fine so I think it probably has to do with some assumptions about the file structures that are built into these functions expected for 10x data.

I'll let you know if I find a good solution.

drneavin commented 3 years ago

Solved per your recommendation to create an AnnData object saved as standard h5ad and used that as input for solo. Seems to be working now. Thanks for the help!