annotate_data crashes on cell ontology with UnicodeDecodeError [Colab]

nagendraKU commented 1 year ago

Hi!

Thanks for the recent updates to the repo! I was able to get through the pre-processing steps but I am running into an issue at the annotate_data step. I am running the updated Tabula sapiens tutorial on Colab with high RAM backend.

I downloaded the ontology files from https://figshare.com/articles/dataset/OnClass_data_minimal/14776281.

Here's the preprocessing setup

adata = Process_Query(
    query_adata,
    ref_adata,
    query_labels_key=query_labels_key,
    query_batch_key=query_batch_key,
    ref_labels_key=ref_labels_key,
    ref_batch_key=ref_batch_key,
    unknown_celltype_label="unknown",
    save_path_trained_models=output_model_fn,
    cl_obo_folder='/content/drive/My Drive/scData/onclass_files/',
    prediction_mode="retrain",  
    n_samples_per_label=n_samples_per_label,
    use_gpu=0,
    compute_embedding=True,
    hvg=4000,
    return_probabilities=True,
).adata

With the anndata from the above step, I run the following code:

from popv.annotation import annotate_data
annotate_data(adata, methods = ["knn_on_scvi", "scanvi"], save_path=f"{output_folder}/popv_output")

Error:

INFO     File tmp/pretrained_model_Fat//scvi/model.pt already downloaded                                           
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 200/200: 100%|██████████| 200/200 [02:56<00:00,  1.08it/s, loss=1.4e+03, v_num=1]INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=200` reached.
Epoch 200/200: 100%|██████████| 200/200 [02:56<00:00,  1.13it/s, loss=1.4e+03, v_num=1]
INFO     AnnData object appears to be a copy. Attempting to transfer setup.                                        
INFO     File tmp/pretrained_model_Fat//scvi/model.pt already downloaded                                           
INFO     Training for 20 epochs.                                                                                   
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 20/20: 100%|██████████| 20/20 [00:30<00:00,  1.53s/it, loss=1.98e+03, v_num=1]INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.
Epoch 20/20: 100%|██████████| 20/20 [00:30<00:00,  1.50s/it, loss=1.98e+03, v_num=1]
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[25], line 3
      1 from popv.annotation import annotate_data
----> 3 annotate_data(adata, methods = ["knn_on_scvi", "scanvi"], save_path=f"{output_folder}/popv_output")

File /usr/local/lib/python3.8/dist-packages/popv/annotation.py:73, in annotate_data(adata, methods, save_path, methods_kwargs)
     69     adata.obs[["popv_prediction", "popv_prediction_score"]] = adata.obs[
     70         ["popv_majority_vote_prediction", "popv_majority_vote_score"]
     71     ]
     72 else:
---> 73     ontology_vote_onclass(adata, all_prediction_keys)
     74     ontology_parent_onclass(adata, all_prediction_keys)
     76 if save_path is not None:

File /usr/local/lib/python3.8/dist-packages/popv/annotation.py:141, in ontology_vote_onclass(adata, prediction_keys, save_key)
    122 """
    123 Compute prediction using ontology aggregation method.
    124 
   (...)
    138 Saves the overlap in original prediction in
    139 """
    140 if adata.uns["_prediction_mode"] == "retrain":
--> 141     G = _utils.make_ontology_dag(adata.uns["_cl_obo_file"])
    142     if adata.uns["_save_path_trained_models"] is not None:
    143         pickle.dump(
    144             G, open(adata.uns["_save_path_trained_models"] + "obo_dag.pkl", "wb")
    145         )

File /usr/local/lib/python3.8/dist-packages/popv/_utils.py:147, in make_ontology_dag(obofile, lowercase)
    133 def make_ontology_dag(obofile, lowercase=False):
    134     """
    135     Construct a graph with all cell-types.
    136 
   (...)
    145         Graph containing all cell-types
    146     """
--> 147     co = obonet.read_obo(obofile)
    148     id_to_name = {id_: data.get("name") for id_, data in co.nodes(data=True)}
    149     name_to_id = {
    150         data["name"]: id_ for id_, data in co.nodes(data=True) if ("name" in data)
    151     }

File /usr/local/lib/python3.8/dist-packages/obonet/read.py:30, in read_obo(path_or_file, ignore_obsolete)
     13 """
     14 Return a networkx.MultiDiGraph of the ontology serialized by the
     15 specified path or file.
   (...)
     27     not be added to the graph.
     28 """
     29 obo_file = open_read_file(path_or_file)
---> 30 typedefs, terms, instances, header = get_sections(obo_file)
     31 obo_file.close()
     33 if "ontology" in header:

File /usr/local/lib/python3.8/dist-packages/obonet/read.py:77, in get_sections(lines)
     75     continue
     76 stanza_type_line = next(stanza_lines)
---> 77 stanza_lines = list(stanza_lines)
     78 if stanza_type_line.startswith("[Typedef]"):
     79     typedef = parse_stanza(stanza_lines, typedef_tag_singularity)

File /usr/lib/python3.8/encodings/ascii.py:26, in IncrementalDecoder.decode(self, input, final)
     25 def decode(self, input, final=False):
---> 26     return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2795: ordinal not in range(128)

Any help in fixing this error is appreciated!

canergen commented 1 year ago

Please set ontology_folder in Process_Query to the PopV GitHub repository https://github.com/czbiohub/PopV/blob/main/ontology (it will otherwise fail for some organs as our ontology is newer and contains recently added cell-types). The issue is otherwise on the obonet side (I don't think one can fix it though and it was an unusual setting for Python to store a file UTF8 encoded in the file you are trying to use). Closing for now. Please reopen if you run into similar issues with our ontology files.

nagendraKU commented 1 year ago

Hi @cane11, thanks for the fast reply! I set the ontology_folder as per your recommendation and I am running into the same problem again.

The obo file loads correctly if I run it before annotate_data

from popv.annotation import _utils
G = _utils.make_ontology_dag(adata.uns["_cl_obo_file"])

However, once I run annotate_data (which crashes with the unicode error), the above snippet also crashes with the same unicode error. So something weird is happening when running annotate_data? I am using a Colab pro account with a standard GPU + high RAM backend.

I understand if this is still an external package (obonet) issue.

canergen commented 1 year ago

I guess you started in a fresh runtime(?). Can you print the output of adata.uns["_cl_obo_file"] before and after running annotate_data? Can you check the md5sum of the cl_obo file before and after running the script. The script shouldn't change the downloaded ontology files, so I am indeed a bit puzzled. Can you replace the cl_obo_file with a backup version after failing in annotate_data? The other thing that could cause the failing is that the encoder is switching during running. I need to figure out how to get the current encoder.

nagendraKU commented 1 year ago

Hi @cane11, Yes started with a fresh runtime.

I have narrowed down the issue to the scVI (and as consequence, scANVI) method ("knn_on_scvi"). I was able to run annotate_data successfully when I chose knn_on_scanorama, celltypist, and onclass (did not try bbknn, rf, and svm). adata.uns["_cl_obo_file"] is the same before and after running annotate_data, and the md5 hashes are identical as well. Only scVI (and scANVI) produces the error raised in this issue.

In one instance, after annotate_data failed with scvi (on a gpu instance), I immediately ran md5sum on the obo file and that failed with an error:

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

After the annotate_data failure with the scvi method, I tried setting the locale to en_US and UTF-8 and directly reading the obo file, but the file loading issue persists once the scvi training has been done in a colab session.

So my guess here is that training the scvi model (on the gpu) somehow messes with the locale encoding? Any chance it has something to do with the pretrained model ?

I wondered if this was a gpu issue and tried to set adata.uns['_use_gpu'] = 1 to train using cpu, but annotate_data fails to run with the following error message:

MisconfigurationException: No supported gpu backend found!

Hope this helps with the diagnosis. Thanks again for the quick support!

canergen commented 1 year ago

I thought initially it is ANSI encoder and turns to UTF-8 after some import in scVI or model loading. Can you verify that importing scVI doesn't make it fail? In the best case setting mode='retrain' in Process_Query to train all classifiers from scratch. When this work, I will have a look next week at the pretrained models and the best way for now is to train from scratch (should take 40 minutes with GPU enabled). The command to enable GPU is use_gpu='0', 'True' or best case 'cuda:0'. '1' would require two local GPUs that you don't have in Colab.

canergen commented 1 year ago

I tried debugging this. However, when I run PopV ten times it is maybe happening once. In this fashion, it is not possible to debug it. I guess it is a Colab problem and I'm not sure what is causing this. Is it more reproducible in your hands @nagendraKU?

nagendraKU commented 1 year ago

@cane11 Yes, happens every time when I choose scVI or scANVI. I also think this is a Colab issue as I encountered the A UTF-8 locale is required. Got ANSI_X3.4-1968 error in a Colab notebook when working with scVI (nothing related to PopV). Although there I haven't been able to reproduce the error.

When I get time, I will try PopV in a local environment. Thanks for all your efforts!

canergen commented 1 year ago

It should be fixed in the newest version of PopV (as it is not fully reproducible for me, rerunning it on your side would be great). Obonet released a new version that allows setting the text encoding, which is fixed in the current master. Of note, the encoding is still changed to ANSI, which is a well-described Colab issue - https://github.com/deepmind/alphafold/issues/483. As reproducibility is low when rerunning the same script multiple times in the same folder (most likely it is the data download and unzipping before actually running PopV), I would recommend using local data and unzipped models if the change to ANSI destroys other things you want to run in the notebook. With local models and data, this behavior was not happening to me.

YosefLab / PopV

annotate_data crashes on cell ontology with UnicodeDecodeError [Colab] #20