Schema/Metadata error when reading load_data_with_illum.parquet using polars

afermg commented 8 months ago

Taken from the slack thread here

Alán: Both a tool and an open question (for @ank but posted here because it is of general interest). To access cpg from polars (in a lazy and non-lazy manner) without credentials you can use

if lazy:
    # TODO Raise issue and find the problem with the image location datasets.
    # It seems to be related related to UTF8 encoding.
    # Example path failing:
    # 's3a://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_17_U2OS_48_hr_run16/Dest210809-134534/load_data_with_illum.parquet'
    fs = S3FileSystem(anon=True)
    ds = dataset(path, filesystem=fs)
    result = pl.scan_pyarrow_dataset(ds)
else:
    # Read whole dataframe
    result = pl.read_parquet(
        path,
        use_pyarrow=True,
        storage_options={"anon": True},
    )

You can see the specific imports and documentation on https://github.com/broadinstitute/monorepo/blob/7b9f9d53db69dbfd56ca12bbcec39e882f878906/libs/jump_portrait/src/jump_portrait/s3.py#L140-L157Does anyone know where is the code that generated the load_data_with_illum.parquet? There's something off with the column types and I can't lazy-load it (but loading it whole works fine). I can use the above code to load profile parquets but not those. (edited) s3.py

if lazy:
    # TODO Raise issue and find the problem with the image location datasets.
    # It seems to be related related to UTF8 encoding.
    # Example path failing:
    # 's3a://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_17_U2OS_48_hr_run16/Dest210809-134534/load_data_with_illum.parquet'

https://github.com/[broadinstitute/monorepo](https://github.com/broadinstitute/monorepo)|broadinstitute/monorepobroadinstitute/monorepo | Added by GitHub 14 replies

Shantanu 16 hours ago Here is the codehttps://github.com/jump-cellpainting/datasets-private/blob/ee6eede8add80a40a8d2fed6e5173944a60b17ed/load_data/curated_load_data.Rmd 16 hours ago I ran it on source by source basis 16 hours ago Here are PRs https://github.com/jump-cellpainting/datasets-private/pulls?q=load_data 16 hours ago Seems like a type issue

ComputeError: parquet error: Error { sourcelocation: ErrorLocation { type: "KeyValue", method: "value", byte_offset: 9300 }, error_kind: InvalidUtf8 { source: Utf8Error { valid_up_to: 92, error_len: Some(1) } } }

16 hours ago df = pd.read_parquet("~/Downloads/load_data_with_illum.parquet") works df = pl.read_parquet("~/Downloads/load_data_with_illum.parquet") does not 16 hours ago This worked:

df = pd.read_parquet("~/Downloads/load_data_with_illum.parquet") df.to_parquet("~/Downloads/load_data_with_illum_fixed.parquet") df = pl.read_parquet("~/Downloads/load_data_with_illum_fixed.parquet")

16 hours ago So something funky happening with typesHope that helps 13 hours ago Thanks! I can usually read it by setting use_pyarrow=True, I just need to find out what that does and how to use it when lazy-loading. 4 hours ago what you’ve found points to some issue with the data types, and fixing those might be easier (see the fix above, assuming it’s actually a fix and not some unwanted coercion) 37 minutes ago But that would require updating all the files on aws, which seems more laborious from my perspective. If this is a fringe case on how pandas saves parquets that is incompatible in certain cases with polars, it may have happened elsewhere in our datasets and we would have to also undertake the task of finding and replacing all those. 36 minutes ago Whereas if we use the pyarrow reader that seems to work for them with no inconvenience, the problem is automatically solved everywhere. 21 minutes ago We can do it en masse – it's not awful because we can index into all load datas using plate.csv.gz. The files are pretty light so it should be a few hours of wall time. Keep us posted. 21 minutes ago :taco: @Alán for working through this! < 1 minute ago

Alán Ok, I think I have a vague idea of what is going on. After seeing how they were generated from r, the schema that these files come with makes more sense. The problem is likely this metadata: ARROW:extension:name:'X UTF-8 ' + 49 and ARROW:extension:name: 'arrow.r.vctrs'.The schema is shown below for future reference. I will open an issue and paste all this so we can reference this in the future.

ds.schema, the problem is likely to be some columns  having thje
Out[20]: 
Metadata_Source: string
Metadata_Batch: string
Metadata_Plate: string
Metadata_Well: string
Metadata_Site: string
FileName_IllumAGP: string
FileName_IllumDNA: string
FileName_IllumER: string
FileName_IllumMito: string
FileName_IllumRNA: string
FileName_OrigAGP: string
FileName_OrigDNA: string
FileName_OrigER: string
FileName_OrigMito: string
FileName_OrigRNA: string
PathName_IllumAGP: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumDNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumER: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumMito: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumRNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigAGP: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigDNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigER: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigMito: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigRNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
UTF-8   ' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
-- schema metadata --
r: 'A
3
262402
197888
5
UTF-8
531
2
531
1
787
3
531
53
787
0
1026
1
26215' + 6439

afermg commented 8 months ago

The way it works when loading it using pyarrow is because pyarrow uses ffspec to load the whole table into memory at once [1]. I need to see if we can force the schema of all columns to be pl.Utf8 strings.

[1] https://github.com/apache/arrow/blob/0402e306a9d9f57ff22c87bf8689b8e7203483e5/python/pyarrow/parquet/core.py#L1454C1-L1458C1

afermg commented 8 months ago

I am pretty convinced that this is the R pyarrow library changing the schema metadata when facing string columns of longer lengths. The metadata being whacky matches long columns

Whacky field metadata

ds.schema
Out[224]: 
Metadata_Source: string
Metadata_Batch: string
Metadata_Plate: string
Metadata_Well: string
Metadata_Site: string
FileName_IllumAGP: string
FileName_IllumDNA: string
FileName_IllumER: string
FileName_IllumMito: string
FileName_IllumRNA: string
FileName_OrigAGP: string
FileName_OrigDNA: string
FileName_OrigER: string
FileName_OrigMito: string
FileName_OrigRNA: string
PathName_IllumAGP: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumDNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumER: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumMito: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_IllumRNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigAGP: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigDNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigER: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigMito: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
PathName_OrigRNA: string
  -- field metadata --
  ARROW:extension:metadata: 'X
���������UTF-8�������������    �' + 49
  ARROW:extension:name: 'arrow.r.vctrs'
-- schema metadata --
r: 'A
3
262402
197888
5
UTF-8
531
2
531
1
787
3
531
53
787
0
1026
1
26215' + 6439

Length of columns:


result.with_columns(pl.all().str.len_bytes()).max().to_dicts()[0]

In [222]: Out[222]: 
{'Metadata_Source': 9,
 'Metadata_Batch': 27,
 'Metadata_Plate': 17,
 'Metadata_Well': 3,
 'Metadata_Site': 1,
 'FileName_IllumAGP': 30,
 'FileName_IllumDNA': 30,
 'FileName_IllumER': 29,
 'FileName_IllumMito': 31,
 'FileName_IllumRNA': 30,
 'FileName_OrigAGP': 47,
 'FileName_OrigDNA': 47,
 'FileName_OrigER': 47,
 'FileName_OrigMito': 47,
 'FileName_OrigRNA': 47,
 'PathName_IllumAGP': 107,
 'PathName_IllumDNA': 107,
 'PathName_IllumER': 107,
 'PathName_IllumMito': 107,
 'PathName_IllumRNA': 107,
 'PathName_OrigAGP': 109,
 'PathName_OrigDNA': 109,
 'PathName_OrigER': 109,
 'PathName_OrigMito': 109,
 'PathName_OrigRNA': 109}

afermg commented 8 months ago

I sorted it out by ignoring the metadata introduced by R. See commit fc3107d for details. Hopefully it won't add much overhead. This solution does not mean that we can't also fix it from data-producing side @shntnu. The solution is to avoid providing metadata when saving the files, specifically for columns with long filenames. Let me know if this suffices as a solution and I should close the issue, or if you want to wait until the datasets are updated with the metadata-less verisons.

shntnu commented 8 months ago

Let me know if this suffices as a solution and I should close the issue, or if you want to wait until the datasets are updated with the metadata-less verisons.

Certainly please keep going, keeping this fix in mind

Also, please create an issue in https://github.com/jump-cellpainting/datasets-private/ about fixing the parquets

broadinstitute / monorepo

Schema/Metadata error when reading load_data_with_illum.parquet using polars #21