broadinstitute / gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
https://gnomad.broadinstitute.org
MIT License
86 stars 27 forks source link

Error of "TypeError: object of type 'NoneType' has no len()" when doing ancestry classification #465

Closed Jingning-Zhang closed 1 year ago

Jingning-Zhang commented 2 years ago

What you did:

We want to use the ancestry classification tool to predict genetic ancestry of UKBB samples. (https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/)

What happened:

There is an error of "TypeError: object of type 'NoneType' has no len()" when I finally did the prediction.

Below I attach all my codes and their outputs.


Python 3.9.10 (main, Feb 22 2022, 16:34:24) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> # Two measurements at the initial visit (*_0_0 and *_0_1) are averaged
>>> import pandas as pd

## Import imputed genotype data
mt_to_project = hl.import_bgen(bgen_path,
                    entry_fields=['GT'],
                    sample_file=sample_path,
                    index_file_map={bgen_path: bgenidx_path})

import pickle
from gnomad.sample_qc.ancestry import assign_population_pcs

# Load MatrixTable for projection and gnomAD loadings Hail Table
loadings_ht = hl.read_table(path_to_gnomad_loadings)

# Project new genotypes onto loadings
ht = hl.experimental.pc_project(
    mt_to_project.GT,
    loadings_ht.loadings,
    loadings_ht.pca_af,
)

>>> import os
>>> import sys
>>> os.environ["PYSPARK_SUBMIT_ARGS"]="--total-executor-cores 6 --executor-memory 5g --driver-memory 10g pyspark-shell"
>>> 
>>> import hail as hl

# Assign global ancestry using the gnomAD RF model and PC project scores
# Loading of the v2 RF model requires an older version of scikit-learn, this can be installed using pip install -U scikit-learn==0.21.3
with hl.hadoop_open(path_to_gnomad_rf, "rb") as f:
    fit = pickle.load(f)

# Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1
num_pcs = fit.n_features_
ht = ht.annotate(scores=ht.scores[:num_pcs])
htt, rf_model = assign_population_pcs(
    ht,
    pc_cols=ht.scores,
    fit=fit,
)

>>> hl.init()
2022-07-01 11:01:00 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-07-01 11:01:01 WARN  Hail:43 - This Hail JAR was compiled for Spark 3.1.2, running with Spark 3.1.3.
  Compatibility is not guaranteed.
Running on Apache Spark version 3.1.3
SparkUI available at http://compute-106.cm.cluster:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.96-39909e0a396f
LOGGING: writing to /users/jzhang2/hail-20220701-1100-0.2.96-39909e0a396f.log
>>> 
>>> ## Path to input files
>>> bgen_path = "./chrall.bgen"
>>> sample_path = "./chrall.sample"
>>> bgenidx_path = "./chrall.idx2"
>>> path_to_gnomad_loadings = "./gnomad/gnomad.v3.1.pca_loadings.ht"
>>> path_to_gnomad_rf = "./gnomad/gnomad.v3.1.RF_fit.pkl"
>>> 
>>> 
>>> ## Import imputed genotype data
>>> mt_to_project = hl.import_bgen(bgen_path,
...                     entry_fields=['GT'],
...                     sample_file=sample_path,
...                     index_file_map={bgen_path: bgenidx_path})
2022-07-01 11:01:08 Hail: INFO: Number of BGEN files parsed: 1
2022-07-01 11:01:08 Hail: INFO: Number of samples in BGEN files: 78575
2022-07-01 11:01:08 Hail: INFO: Number of variants across all BGEN files: 75866
>>> 
>>> import pickle
>>> from gnomad.sample_qc.ancestry import assign_population_pcs
>>> 
>>> # Load MatrixTable for projection and gnomAD loadings Hail Table
>>> loadings_ht = hl.read_table(path_to_gnomad_loadings)
>>> 
>>> # Project new genotypes onto loadings
>>> ht = hl.experimental.pc_project(
...     mt_to_project.GT,
...     loadings_ht.loadings,
...     loadings_ht.pca_af,
... )
2022-07-01 11:01:10 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
>>> 
>>> 
>>> # Assign global ancestry using the gnomAD RF model and PC project scores
>>> # Loading of the v2 RF model requires an older version of scikit-learn, this can be installed using pip install -U scikit-learn==0.21.3
>>> with hl.hadoop_open(path_to_gnomad_rf, "rb") as f:
...     fit = pickle.load(f)
... 
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
>>> # Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1
>>> num_pcs = fit.n_features_
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `n_features_` was deprecated in version 1.0 and will be removed in 1.2. Use `n_features_in_` instead.
  warnings.warn(msg, category=FutureWarning)
>>> ht = ht.annotate(scores=ht.scores[:num_pcs])
>>> htt, rf_model = assign_population_pcs(
...     ht,
...     pc_cols=ht.scores,
...     fit=fit,
... )
2022-07-01 11:04:02 Hail: INFO: Coerced sorted dataset            (12 + 2) / 16]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/jzhang2/.local/lib/python3.9/site-packages/gnomad/sample_qc/ancestry.py", line 174, in assign_population_pcs
    num_out_cols = min([len(x) for x in pop_pc_pd["pca_scores"].values.tolist()])
  File "/users/jzhang2/.local/lib/python3.9/site-packages/gnomad/sample_qc/ancestry.py", line 174, in <listcomp>
    num_out_cols = min([len(x) for x in pop_pc_pd["pca_scores"].values.tolist()])
TypeError: object of type 'NoneType' has no len()
>>> 
>>> 
jkgoodrich commented 1 year ago

Hi @Jingning-Zhang Thank you for reaching out, and apologies for the long delay in responding. Unfortunately, I wasn't able to reproduce your error with our test dataset.

If you are still having trouble with this and would like assistance, I have some follow-up questions.

Additionally, we have updated that function since the blog post went out and this issue helped us notice that the current blog post code will not work with the updates. It doesn't look like that was the issue here though because the line indicated in your error num_out_cols = min([len(x) for x in pop_pc_pd["pca_scores"].values.tolist()]) no longer exists in the new code, so your error message would be different. I have put in a PR to modify the new code so the blog post example can work.

If you try this again with the newest version of gnomad_methods (v0.6.3), you would need to modify the code in the following way: This:

htt, rf_model = assign_population_pcs(
    ht,
    pc_cols=ht.scores,
    fit=fit,
)

should change to:

htt, rf_model = assign_population_pcs(
    ht,
    pc_cols=range(num_pcs),
    fit=fit,
)

This should be fixed in the next release so the original blog post code can work as expected.

Thanks, Julia

jkgoodrich commented 1 year ago

The newest release v0.6.4 has fixed the assign_population_pcs function so that it still works with the code in the blog post. Closing the issue since I think this is now fixed, but please reach out if it's still a problem.