broadinstitute / gnomad-browser

Explore gnomAD datasets on the web
https://gnomad.broadinstitute.org
MIT License
81 stars 44 forks source link

Error of "TypeError: object of type 'NoneType' has no len()" when doing ancestry classification #926

Closed Jingning-Zhang closed 2 years ago

Jingning-Zhang commented 2 years ago

What you did:

We want to use the ancestry classification tool to predict genetic ancestry of UKBB samples. (https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/)

What happened:

There is an error of "TypeError: object of type 'NoneType' has no len()" when I finally did the prediction.

Below I attach all my codes and their outputs.


Python 3.9.10 (main, Feb 22 2022, 16:34:24) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> # Two measurements at the initial visit (*_0_0 and *_0_1) are averaged
>>> import pandas as pd

## Import imputed genotype data
mt_to_project = hl.import_bgen(bgen_path,
                    entry_fields=['GT'],
                    sample_file=sample_path,
                    index_file_map={bgen_path: bgenidx_path})

import pickle
from gnomad.sample_qc.ancestry import assign_population_pcs

# Load MatrixTable for projection and gnomAD loadings Hail Table
loadings_ht = hl.read_table(path_to_gnomad_loadings)

# Project new genotypes onto loadings
ht = hl.experimental.pc_project(
    mt_to_project.GT,
    loadings_ht.loadings,
    loadings_ht.pca_af,
)

>>> import os
>>> import sys
>>> os.environ["PYSPARK_SUBMIT_ARGS"]="--total-executor-cores 6 --executor-memory 5g --driver-memory 10g pyspark-shell"
>>> 
>>> import hail as hl

# Assign global ancestry using the gnomAD RF model and PC project scores
# Loading of the v2 RF model requires an older version of scikit-learn, this can be installed using pip install -U scikit-learn==0.21.3
with hl.hadoop_open(path_to_gnomad_rf, "rb") as f:
    fit = pickle.load(f)

# Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1
num_pcs = fit.n_features_
ht = ht.annotate(scores=ht.scores[:num_pcs])
htt, rf_model = assign_population_pcs(
    ht,
    pc_cols=ht.scores,
    fit=fit,
)

>>> hl.init()
2022-07-01 11:01:00 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-07-01 11:01:01 WARN  Hail:43 - This Hail JAR was compiled for Spark 3.1.2, running with Spark 3.1.3.
  Compatibility is not guaranteed.
Running on Apache Spark version 3.1.3
SparkUI available at http://compute-106.cm.cluster:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.96-39909e0a396f
LOGGING: writing to /users/jzhang2/hail-20220701-1100-0.2.96-39909e0a396f.log
>>> 
>>> ## Path to input files
>>> bgen_path = "./chrall.bgen"
>>> sample_path = "./chrall.sample"
>>> bgenidx_path = "./chrall.idx2"
>>> path_to_gnomad_loadings = "./gnomad/gnomad.v3.1.pca_loadings.ht"
>>> path_to_gnomad_rf = "./gnomad/gnomad.v3.1.RF_fit.pkl"
>>> 
>>> 
>>> ## Import imputed genotype data
>>> mt_to_project = hl.import_bgen(bgen_path,
...                     entry_fields=['GT'],
...                     sample_file=sample_path,
...                     index_file_map={bgen_path: bgenidx_path})
2022-07-01 11:01:08 Hail: INFO: Number of BGEN files parsed: 1
2022-07-01 11:01:08 Hail: INFO: Number of samples in BGEN files: 78575
2022-07-01 11:01:08 Hail: INFO: Number of variants across all BGEN files: 75866
>>> 
>>> import pickle
>>> from gnomad.sample_qc.ancestry import assign_population_pcs
>>> 
>>> # Load MatrixTable for projection and gnomAD loadings Hail Table
>>> loadings_ht = hl.read_table(path_to_gnomad_loadings)
>>> 
>>> # Project new genotypes onto loadings
>>> ht = hl.experimental.pc_project(
...     mt_to_project.GT,
...     loadings_ht.loadings,
...     loadings_ht.pca_af,
... )
2022-07-01 11:01:10 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
>>> 
>>> 
>>> # Assign global ancestry using the gnomAD RF model and PC project scores
>>> # Loading of the v2 RF model requires an older version of scikit-learn, this can be installed using pip install -U scikit-learn==0.21.3
>>> with hl.hadoop_open(path_to_gnomad_rf, "rb") as f:
...     fit = pickle.load(f)
... 
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
>>> # Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1
>>> num_pcs = fit.n_features_
/jhpce/shared/jhpce/core/python/3.9.10/lib/python3.9/site-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `n_features_` was deprecated in version 1.0 and will be removed in 1.2. Use `n_features_in_` instead.
  warnings.warn(msg, category=FutureWarning)
>>> ht = ht.annotate(scores=ht.scores[:num_pcs])
>>> htt, rf_model = assign_population_pcs(
...     ht,
...     pc_cols=ht.scores,
...     fit=fit,
... )
2022-07-01 11:04:02 Hail: INFO: Coerced sorted dataset            (12 + 2) / 16]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/jzhang2/.local/lib/python3.9/site-packages/gnomad/sample_qc/ancestry.py", line 174, in assign_population_pcs
    num_out_cols = min([len(x) for x in pop_pc_pd["pca_scores"].values.tolist()])
  File "/users/jzhang2/.local/lib/python3.9/site-packages/gnomad/sample_qc/ancestry.py", line 174, in <listcomp>
    num_out_cols = min([len(x) for x in pop_pc_pd["pca_scores"].values.tolist()])
TypeError: object of type 'NoneType' has no len()
>>> 
>>> 
mattsolo1 commented 2 years ago

Hi Jingning, can you please post your question at https://github.com/broadinstitute/gnomad_methods?

Jingning-Zhang commented 2 years ago

Hi Jingning, can you please post your question at https://github.com/broadinstitute/gnomad_methods?

Sure. I have posted it at https://github.com/broadinstitute/gnomad_methods. Thank you!