aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

ValueError: Cannot convert float NaN to integer for compute_qc_stats from pycisTopic.qc #81

Closed orian116 closed 1 year ago

orian116 commented 1 year ago

Following the tutorial on https://scenicplus.readthedocs.io/en/latest/pbmc_multiome_tutorial.html, I seem to be getting data type conversion error at the qc step. I tried to remove NaNs in the annot dataset and the consensus_bed and fragments files dont appear to have any NaNs

The error is coming from:

metadata_bc, profile_data_dict = compute_qc_stats(
                fragments_dict = fragments_dict,
                tss_annotation = annot,
                stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
                label_list = None,
                path_to_regions = path_to_regions,
                n_cpu = 1,
                valid_bc = None,
                n_frag = 100,
                n_bc = None,
                tss_flank_window = 1000,
                tss_window = 50,
                tss_minimum_signal_window = 100,
                tss_rolling_window = 10,
                remove_duplicates = True,
                _temp_dir = os.path.join(tmp_dir + 'ray_spill'))

The error seems to be coming from the creating coverage matrix step. This is the error output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[99], line 3
      1 path_to_regions = {'10x_pbmc':os.path.join(work_dir, 'scATAC/consensus_peak_calling/consensus_regions.bed')}
----> 3 metadata_bc, profile_data_dict = compute_qc_stats(
      4                 fragments_dict = fragments_dict,
      5                 tss_annotation = annot,
      6                 stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
      7                 label_list = None,
      8                 path_to_regions = path_to_regions,
      9                 n_cpu = 1,
     10                 valid_bc = None,
     11                 n_frag = 100,
     12                 n_bc = None,
     13                 tss_flank_window = 1000,
     14                 tss_window = 50,
     15                 tss_minimum_signal_window = 100,
     16                 tss_rolling_window = 10,
     17                 remove_duplicates = True,
     18                 _temp_dir = os.path.join(tmp_dir + 'ray_spill'))

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pycisTopic/qc.py:1033, in compute_qc_stats(fragments_dict, tss_annotation, stats, label_list, path_to_regions, n_cpu, partition, valid_bc, n_frag, n_bc, tss_flank_window, tss_window, tss_minimum_signal_window, tss_rolling_window, min_norm, check_for_duplicates, remove_duplicates, use_polars, **kwargs)
   1031     ray.shutdown()
   1032 else:
-> 1033     qc_stats = [
   1034             compute_qc_stats_single(
   1035                 fragments_list[i],
   1036                 tss_annotation=tss_annotation,
   1037                 stats=stats,
   1038                 label=label_list[i],
   1039                 path_to_regions=path_to_regions[i],
   1040                 valid_bc=valid_bc,
   1041                 n_frag=n_frag,
   1042                 n_bc=n_bc,
   1043                 tss_flank_window=tss_flank_window,
   1044                 tss_window=tss_window,
   1045                 tss_minimum_signal_window=tss_minimum_signal_window,
   1046                 tss_rolling_window=tss_rolling_window,
   1047                 min_norm=min_norm,
   1048                 partition=partition,
   1049                 check_for_duplicates=check_for_duplicates,
   1050                 remove_duplicates=remove_duplicates,
   1051                 use_polars=use_polars,
   1052             )
   1053             for i in range(len(fragments_list))
   1054         ]
   1055 metadata_dict = {
   1056     key: x[key] for x in list(list(zip(*qc_stats))[0]) for key in x.keys()
   1057 }
   1058 profile_data_dict = {
   1059     key: x[key] for x in list(list(zip(*qc_stats))[1]) for key in x.keys()
   1060 }

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pycisTopic/qc.py:1034, in <listcomp>(.0)
   1031     ray.shutdown()
   1032 else:
   1033     qc_stats = [
-> 1034             compute_qc_stats_single(
   1035                 fragments_list[i],
   1036                 tss_annotation=tss_annotation,
   1037                 stats=stats,
   1038                 label=label_list[i],
   1039                 path_to_regions=path_to_regions[i],
   1040                 valid_bc=valid_bc,
   1041                 n_frag=n_frag,
   1042                 n_bc=n_bc,
   1043                 tss_flank_window=tss_flank_window,
   1044                 tss_window=tss_window,
   1045                 tss_minimum_signal_window=tss_minimum_signal_window,
   1046                 tss_rolling_window=tss_rolling_window,
   1047                 min_norm=min_norm,
   1048                 partition=partition,
   1049                 check_for_duplicates=check_for_duplicates,
   1050                 remove_duplicates=remove_duplicates,
   1051                 use_polars=use_polars,
   1052             )
   1053             for i in range(len(fragments_list))
   1054         ]
   1055 metadata_dict = {
   1056     key: x[key] for x in list(list(zip(*qc_stats))[0]) for key in x.keys()
   1057 }
   1058 profile_data_dict = {
   1059     key: x[key] for x in list(list(zip(*qc_stats))[1]) for key in x.keys()
   1060 }

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pycisTopic/qc.py:1217, in compute_qc_stats_single(fragments, tss_annotation, stats, label, path_to_regions, valid_bc, n_frag, n_bc, tss_flank_window, tss_window, tss_minimum_signal_window, tss_rolling_window, min_norm, partition, check_for_duplicates, remove_duplicates, use_polars)
   1214 if "profile_tss" in stats:
   1215     # TSS
   1216     log.info("Computing TSS profile for " + label)
-> 1217     profile_tss_metrics = profile_tss(
   1218         fragments=fragments_df,
   1219         annotation=tss_annotation,
   1220         valid_bc=valid_bc,
   1221         plot=False,
   1222         n_cpu=1,
   1223         partition=partition,
   1224         flank_window=tss_flank_window,
   1225         tss_window=tss_window,
   1226         minimum_signal_window=tss_minimum_signal_window,
   1227         rolling_window=tss_rolling_window,
   1228         min_norm=min_norm,
   1229         return_TSS_enrichment_per_barcode=True,
   1230         return_TSS_coverage_matrix_per_barcode=True,
   1231         return_plot_data=True,
   1232     )
   1233     if profile_tss_metrics is not None:
   1234         metrics["profile_tss"] = profile_tss_metrics

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pycisTopic/qc.py:555, in profile_tss(fragments, annotation, valid_bc, plot, plot_data, n_cpu, partition, flank_window, tss_window, minimum_signal_window, rolling_window, min_norm, color, save, return_TSS_enrichment_per_barcode, return_TSS_coverage_matrix_per_barcode, return_plot_data, use_polars)
    544     TSS_matrix = pd.concat(
    545         [
    546             get_tss_matrix(
   (...)
    552         ]
    553     )
    554 else:
--> 555     TSS_matrix = get_tss_matrix(fragments, flank_window, tss_space_annotation)
    556 log.info("Coverage matrix done")
    557 if not TSS_matrix.columns.tolist() == list(range(2 * flank_window + 1)):

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pycisTopic/utils.py:280, in get_tss_matrix(fragments, flank_window, tss_space_annotation)
    277 if len(overlap_with_TSS) == 0:
    278     return
--> 280 overlap_with_TSS["Strand"] = overlap_with_TSS["Strand"].astype(np.int32)
    281 overlap_with_TSS["start_pos"] = -(
    282     np.int32(overlap_with_TSS["Start_b"].values)
    283     + np.int32(flank_window)
    284     - np.int32(overlap_with_TSS["Start"].values)
    285 ) * np.int32(overlap_with_TSS["Strand"].values)
    286 overlap_with_TSS["end_pos"] = -(
    287     np.int32(overlap_with_TSS["Start_b"].values)
    288     + np.int32(flank_window)
    289     - np.int32(overlap_with_TSS["End"].values)
    290 ) * np.int32(overlap_with_TSS["Strand"].values)

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/generic.py:6245, in NDFrame.astype(self, dtype, copy, errors)
   6238     results = [
   6239         self.iloc[:, i].astype(dtype, copy=copy)
   6240         for i in range(len(self.columns))
   6241     ]
   6243 else:
   6244     # else, only a single dtype is given
-> 6245     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6246     return self._constructor(new_data).__finalize__(self, method="astype")
   6248 # GH 33113: handle empty frame or series

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/internals/managers.py:446, in BaseBlockManager.astype(self, dtype, copy, errors)
    445 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 446     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/internals/managers.py:348, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    346         applied = b.apply(f, **kwargs)
    347     else:
--> 348         applied = getattr(b, f)(**kwargs)
    349 except (TypeError, NotImplementedError):
    350     if not ignore_failures:

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/internals/blocks.py:527, in Block.astype(self, dtype, copy, errors)
    509 """
    510 Coerce to the new dtype.
    511 
   (...)
    523 Block
    524 """
    525 values = self.values
--> 527 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    529 new_values = maybe_coerce_values(new_values)
    530 newb = self.make_block(new_values)

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:299, in astype_array_safe(values, dtype, copy, errors)
    296     return values.copy()
    298 try:
--> 299     new_values = astype_array(values, dtype, copy=copy)
    300 except (ValueError, TypeError):
    301     # e.g. astype_nansafe can fail on object-dtype of strings
    302     #  trying to convert to float
    303     if errors == "ignore":

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:227, in astype_array(values, dtype, copy)
    223     return values
    225 if not isinstance(values, np.ndarray):
    226     # i.e. ExtensionArray
--> 227     values = values.astype(dtype, copy=copy)
    229 else:
    230     values = astype_nansafe(values, dtype, copy=copy)

File ~/SCENIC/conda_version/scenic/lib/python3.9/site-packages/pandas/core/arrays/categorical.py:538, in Categorical.astype(self, dtype, copy)
    535     return super().astype(dtype, copy=copy)
    537 elif is_integer_dtype(dtype) and self.isna().any():
--> 538     raise ValueError("Cannot convert float NaN to integer")
    540 elif len(self.codes) == 0 or len(self.categories) == 0:
    541     result = np.array(
    542         self,
    543         dtype=dtype,
    544         copy=copy,
    545     )

ValueError: Cannot convert float NaN to integer

I'm running this on a jupyter notebook after setting up my environment as follows:

source scenic/bin/activate
conda create --name scenicplus python=3.8
conda activate scenicplus
git clone https://github.com/aertslab/scenicplus
cd scenicplus
pip install -e .
conda install ipykernel
python -m ipykernel install --user --name scenicplus
pip install "pydantic<2"
pip install pyrle
jupyter-lab

Any help would be greatly appreciated

ghuls commented 1 year ago

Can you show the first 10 lines of consensus_bed?

Citugulia40 commented 1 year ago

Hi, I am getting the same error and first 10 lines of consensus_bedfile are:

GL000194.1 98334 98834 Astrocyte_peak_1 32.77411693033765 .
GL000194.1 114754 115254 Oligodendrocyte_peak_3 18.571321502075012 .
GL000194.1 104372 104872 Astrocyte_peak_2,Oligodendrocyte_peak_1,Oligodendrocyte_peak_2 47.233286164310144 .
GL000195.1 32228 32728 Astrocyte_peak_5 149.41141541771574 .
GL000195.1 32880 33380 Astrocyte_peak_6 47.233286164310144 .
GL000195.1 30600 31100 Oligodendrocyte_peak_4,Oligodendrocyte_peak_5,Astrocyte_peak_3a,Astrocyte_peak_3b,Astrocyte_peak_4,Microglia_PVM_peak_1,OPC_peak_1 315.2098893006003 .
GL000195.1 31144 31644 OPC_peak_1,Oligodendrocyte_peak_6 117.30246975018129 .
GL000205.2 1206 1706 Oligodendrocyte_peak_7 8.622399268820542 .
GL000205.2 9098 9598 Astrocyte_peak_7 32.77411693033765 .
GL000205.2 39110 39610 Astrocyte_peak_8 20.24283692756149 .
Jinkyustar commented 1 year ago

is this issue solved? I'm hving same issues while running tutorial

Citugulia40 commented 1 year ago

No, I am still waiting for the reply

Let me know when you solve this.

SeppeDeWinter commented 1 year ago

Hi @Jinkyustar and @Citugulia40

I was not able to reproduce the error you are getting so you will have to help me a bit with trouble shooting.

Can you run the following commands and show the output?


annot

set(annot["Strand"])

from pycisTopic.utils import read_fragments_from_file
fragments = read_fragments_from_file(fragments_dict["10x_pbmc"]) #you might have to replace the key in your case
fragments

annotation = annot
flank_window = 1000

tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
    ["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)
overlap_with_TSS = fragments.join(tss_space_annotation, nb_cpu=1).df
overlap_with_TSS

set(overlap_with_TSS["Strand"])

Hope to solve this soon.

Best,

Seppe

Citugulia40 commented 1 year ago

Hi, Thanks for your reply.

Please find the output of these commands.

scenic_plus.pdf

Jinkyustar commented 1 year ago
dataset = pbm.Dataset(name='hsapiens_gene_ensembl',  host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
#annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'})
annot["Strand"] = annot["Strand"].astype(np.int64)

This gives us Strand column as -1 and 1

annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + 1000
tss_space_annotation["Start"] = tss_space_annotation["Start"] - 1000
tss_space_annotation = tss_space_annotation[
    ["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation
pr.pyranges_main.PyRanges(tss_space_annotation)

This then gives you NaN values in the Strand column

annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'})
#or
tss_space_annotation['Strand'] = tss_space_annotation['Strand'].replace({1: "+", -1: "-"})

replacing 1 and -1 to +/- helped getting "overlap_with_TSS" .

tss_space_annotation_gr = pr.PyRanges(tss_space_annotation)
overlap_with_TSS = fragments.join(tss_space_annotation_gr, nb_cpu=1).df

However then it cause problem with "get_tss_matrix" function because the strand column is categorical with +/-

def get_tss_matrix(fragments, flank_window, tss_space_annotation):
    """
    Get TSS matrix
    """
    overlap_with_TSS = fragments.join(tss_space_annotation, nb_cpu=1).df
    if len(overlap_with_TSS) == 0:
        return

    overlap_with_TSS["Strand"] = overlap_with_TSS["Strand"].astype(np.int32)
MariaRosariaNucera commented 1 year ago

I have the same issue, annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'}) solved the NaN error but then I get this one:

ValueError: Cannot cast object dtype to int32

SeppeDeWinter commented 1 year ago

Thank you @Citugulia40 , @Jinkyustar and @MariaRosariaNucera

I was able to reproduce the issue now. It is related to the pyranges version.

For versions above or equal to 0.0.128:


import pyranges as pr
pr.__version__
>>> '0.0.128'

import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl',  host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
#filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
#annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
flank_window = 1000
annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
    ["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)
tss_space_annotation

+--------------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Strand       |
| (category)   | (int64)   | (int64)   | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         | 3068168   | 3070168   | nan          |
| chr1         | 3068197   | 3070197   | nan          |
| chr1         | 3068211   | 3070211   | nan          |
| chr1         | 3068203   | 3070203   | nan          |
| ...          | ...       | ...       | ...          |
| chrY         | 57066898  | 57068898  | nan          |
| chrY         | 57206481  | 57208481  | nan          |
| chrY         | 57183216  | 57185216  | nan          |
| chrY         | 57183226  | 57185226  | nan          |
+--------------+-----------+-----------+--------------+

For versions below 0.0.128


import pyranges as pr
pr.__version__

>>> '0.0.127'

import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl',  host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
#filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
#annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
flank_window = 1000
annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
    ["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)

tss_space_annotation

+--------------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Strand       |
| (category)   | (int64)   | (int64)   | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         | 3068168   | 3070168   | 1            |
| chr1         | 3068197   | 3070197   | 1            |
| chr1         | 3068211   | 3070211   | 1            |
| chr1         | 3068203   | 3070203   | 1            |
| ...          | ...       | ...       | ...          |
| chrY         | 57066898  | 57068898  | 1            |
| chrY         | 57206481  | 57208481  | 1            |
| chrY         | 57183216  | 57185216  | 1            |
| chrY         | 57183226  | 57185226  | 1            |
+--------------+-----------+-----------+--------------+

For now I fixed the version to below 0.0.128, in the future we will have to update this code. https://github.com/aertslab/pycisTopic/commit/e563fb6647380e1d510bed38779adc2034ec6292

Did this solve your issue?

Best,

Seppe

MariaRosariaNucera commented 1 year ago

Thank you @SeppeDeWinter , I confirm I also get 1 and -1 instead of the NaNs using pyranges version = '0.0.127'.

Citugulia40 commented 1 year ago

I am also able to solve this with pyranges version = '0.0.127'. Thank you so much

orian116 commented 1 year ago

Thank you so much. I no longer get the error

veroniquevoisin commented 1 year ago

Hi, I seem to have the same error and when I check my pyranges version, it is '0.0.127'. Screen Shot 2023-10-05 at 9 06 08 AM