PMBio / deeprvat

Other
31 stars 2 forks source link

Conversion error (df to pq) in merge_annotations #59

Closed Jonas-B-Frank closed 6 months ago

Jonas-B-Frank commented 6 months ago

I am running the most recent version of deepRVAT on a Slurm based HPC system. Snakefiles have been adapted accordingly to include ressources and partition. Data are from a WGS cohort, split over chromsomes to increase speed. I ran the preprocessing pipeline and it worked just fine (I excluded HWE qc) and am currently running the annotation pipeline. In rule merge_annotations I get the following warning, before throwing an error (one error per chromosome, but the same issue):

In line 1052 of annotations.py the warning: Path/deepRVAT_new/deeprvat/annotations/annotations.py:1052: DtypeWarning: Columns (3,4,5,7,8,9,10,11,16,18,19,20,21,22,23,26,27,28,29,30,31,32,33,35,36,37,38,39,40,41,42,71,72,73,74,76,78,80,83) have mixed types. Specify dtype option on import or set low_memory=False. vep_df = pd.read_csv(

In line 1098 of annotations.py the error: Traceback (most recent call last): File "Path/deepRVAT_new/deeprvat/annotations/annotations.py", line 1279, in <module> cli() File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "Path/deepRVAT_new/deeprvat/annotations/annotations.py", line 1098, in merge_annotations ca.to_parquet(out_file) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/frame.py", line 2976, in to_parquet return to_parquet( File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/io/parquet.py", line 430, in to_parquet impl.write( File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/io/parquet.py", line 174, in write table = self.api.Table.from_pandas(df, **from_pandas_kwargs) File "pyarrow/table.pxi", line 3557, in pyarrow.lib.Table.from_pandas File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 624, in dataframe_to_arrays arrays[i] = maybe_fut.result() File "Path/envs/deeprvat_annotations/lib/python3.9/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "Path/envs/deeprvat_annotations/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "Path/envs/deeprvat_annotations/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "Path/envs/deeprvat_annotations/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 592, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 316, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column PHENO with type object')

I guess that writing the df out to a parquet file makes it necessary to convert it to a PyArrow table beforehand, which fails because of the PHENO column. Looking forward to your insights!

Marcel-Mueck commented 6 months ago

Hey Jonas, thank you for submitting the issue. It is possible that the PHENO column contains NA values only, therefore the type cannot be inferred when converting it to a pyarrow table. The PR https://github.com/PMBio/deeprvat/pull/54 which refers to the branch annotations-new-features contains updates to the annotation pipeline, which avoid this issue by dropping some vep columns that we do not need for deeprvat including the PHENO column. We will soon merge this PR into the main branch. However, if you want to fix this issue now I would recommand using the annotations-new-features branch for the annotation step.

Marcel-Mueck commented 5 months ago

Hey Jonas, just letting you know that the issue has been addressed, and the changes are part of the main branch of deeprvat now. Regards, Marcel Mück