Closed nagendraKU closed 1 year ago
Please set ontology_folder in Process_Query to the PopV GitHub repository https://github.com/czbiohub/PopV/blob/main/ontology (it will otherwise fail for some organs as our ontology is newer and contains recently added cell-types). The issue is otherwise on the obonet side (I don't think one can fix it though and it was an unusual setting for Python to store a file UTF8 encoded in the file you are trying to use). Closing for now. Please reopen if you run into similar issues with our ontology files.
Hi @cane11, thanks for the fast reply! I set the ontology_folder as per your recommendation and I am running into the same problem again.
The obo file loads correctly if I run it before annotate_data
from popv.annotation import _utils
G = _utils.make_ontology_dag(adata.uns["_cl_obo_file"])
However, once I run annotate_data (which crashes with the unicode error), the above snippet also crashes with the same unicode error. So something weird is happening when running annotate_data? I am using a Colab pro account with a standard GPU + high RAM backend.
I understand if this is still an external package (obonet) issue.
I guess you started in a fresh runtime(?). Can you print the output of adata.uns["_cl_obo_file"] before and after running annotate_data? Can you check the md5sum of the cl_obo file before and after running the script. The script shouldn't change the downloaded ontology files, so I am indeed a bit puzzled. Can you replace the cl_obo_file with a backup version after failing in annotate_data? The other thing that could cause the failing is that the encoder is switching during running. I need to figure out how to get the current encoder.
Hi @cane11, Yes started with a fresh runtime.
I have narrowed down the issue to the scVI (and as consequence, scANVI) method ("knn_on_scvi"). I was able to run annotate_data
successfully when I chose knn_on_scanorama, celltypist, and onclass (did not try bbknn, rf, and svm). adata.uns["_cl_obo_file"]
is the same before and after running annotate_data, and the md5 hashes are identical as well. Only scVI (and scANVI) produces the error raised in this issue.
In one instance, after annotate_data
failed with scvi (on a gpu instance), I immediately ran md5sum
on the obo file and that failed with an error:
NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968
After the annotate_data
failure with the scvi method, I tried setting the locale to en_US and UTF-8 and directly reading the obo file, but the file loading issue persists once the scvi training has been done in a colab session.
So my guess here is that training the scvi model (on the gpu) somehow messes with the locale encoding? Any chance it has something to do with the pretrained model ?
I wondered if this was a gpu issue and tried to set adata.uns['_use_gpu'] = 1
to train using cpu, but annotate_data
fails to run with the following error message:
MisconfigurationException: No supported gpu backend found!
Hope this helps with the diagnosis. Thanks again for the quick support!
I thought initially it is ANSI encoder and turns to UTF-8 after some import in scVI or model loading. Can you verify that importing scVI doesn't make it fail? In the best case setting mode='retrain' in Process_Query to train all classifiers from scratch. When this work, I will have a look next week at the pretrained models and the best way for now is to train from scratch (should take 40 minutes with GPU enabled). The command to enable GPU is use_gpu='0', 'True' or best case 'cuda:0'. '1' would require two local GPUs that you don't have in Colab.
I tried debugging this. However, when I run PopV ten times it is maybe happening once. In this fashion, it is not possible to debug it. I guess it is a Colab problem and I'm not sure what is causing this. Is it more reproducible in your hands @nagendraKU?
@cane11 Yes, happens every time when I choose scVI or scANVI. I also think this is a Colab issue as I encountered the A UTF-8 locale is required. Got ANSI_X3.4-1968
error in a Colab notebook when working with scVI (nothing related to PopV). Although there I haven't been able to reproduce the error.
When I get time, I will try PopV in a local environment. Thanks for all your efforts!
It should be fixed in the newest version of PopV (as it is not fully reproducible for me, rerunning it on your side would be great). Obonet released a new version that allows setting the text encoding, which is fixed in the current master. Of note, the encoding is still changed to ANSI, which is a well-described Colab issue - https://github.com/deepmind/alphafold/issues/483. As reproducibility is low when rerunning the same script multiple times in the same folder (most likely it is the data download and unzipping before actually running PopV), I would recommend using local data and unzipped models if the change to ANSI destroys other things you want to run in the notebook. With local models and data, this behavior was not happening to me.
Hi!
Thanks for the recent updates to the repo! I was able to get through the pre-processing steps but I am running into an issue at the annotate_data step. I am running the updated Tabula sapiens tutorial on Colab with high RAM backend.
I downloaded the ontology files from https://figshare.com/articles/dataset/OnClass_data_minimal/14776281.
Here's the preprocessing setup
With the anndata from the above step, I run the following code:
Error:
Any help in fixing this error is appreciated!