polars.exceptions.ComputeError: found more fields than defined in 'Schema'

mariamcguinness commented 4 months ago

we are running into an error while trying to extract data from the UKBiobank. We ran this line

$ python UKBB-tabular-processing/melted_UKBB_extract.py --config-file config.brainMRI.yaml \
    --data-file ukb671074.melt.arrow \
    --dictionary-file Metadata/Data_Dictionary_Showcase.tsv \
    --coding-file Metadata/Codings.tsv \
    --category-tree-file catbrowse.txt \
    --data-field-prop-file field.txt \
    --output-prefix df_whole_brain_mri/df_whole_brain_mri_ \
    -v

This is what the config.BrainMRI.yaml file looks like :

## Filtering section

# List of FieldIDs to extract, null to extract all
# See Data_Dictionary_Showcase.tsv for a concise list
FieldIDs:
# Ex: Sex
 - 25009
 - 25007
 - 25005

# Instance aka timepoint, 1-4, null for all
InstanceIDs:
  - null

# Specific subjects to extract, null for all
SubjectIDs:
  - null
# Alternatively, one or more filenames which a flat lists of SubjectIDs
SubjectIDFiles:
  - null
# For fields with array components, null for all
ArrayIDs:
  - null
# Use pre-defined Categories of FieldIDs, added to list above, null for none
Categories:
 - null

## Output control section

# Replicate non-instanced data (aka Sex, other single-point measurements)
# across all instances
replicate_non_instanced: true

# Use data dictionary to recode FieldIDs as <Name>_<FieldID>
recode_field_names: true

# Use data dictionary and coding file to replace FieldValues with decoded entries
recode_data_values: true

# Some FieldValues were saved as empty strings instead of NA, drop these
drop_empty_strings: true

# Strings to drop as null
drop_null_strings:
  - "Do not know"
  - "Prefer not to answer"
  - "Time uncertain/unknown"
  - "Test not completed"
  - "Location could not be mapped"
  - "Abandoned"
  - "Next button not pressed"
  - "Trail not completed"
  - "Do not remember"
  - "Preferred not to answer"

# Numeric vaues to map to null
drop_null_numerics:
  - 99999
  - -9999999
  - -999999.000
  - -99999.000

## Wide output control

# Produce a wide aka pivoted DataFrame in addition to the filtered narrow frame
wide: true

# Use data dictionary to assign proper datatypes to columns in wide output
# Only applies to binary arrow format
recode_wide_column_valuetypes: true

# Attempt to split compound type FieldValues into a list in wide output
convert_compound_to_list: false

# When recode_wide_column_valuetype=true some values from recode_data_values=true
# some values will break setting column datatypes
# Substitute strings to values set below
convert_less_than_value_integer: null
convert_less_than_value_continuous: null

This is the content of our directory

(UKBB_tab) [mariamcg@nia1269 tabular]$ ls *
catbrowse.txt         field.txt         slurm-12369191.out
config.brainMRI.yaml  ids_t1w_ses2.txt  ukb671074.melt.arrow

df_whole_brain_mri:
df_whole_brain_mri_conversion.log

Metadata:
Codings.tsv  Data_Dictionary_Showcase.tsv

UKBB-tabular-processing:
config.py             melted_UKBB_extract.py  README.md         ukb_awk
config.template.yaml  __pycache__             requirements.txt

And this is the error we get;

(UKBB_tab) [mariamcg@nia1269 tabular]$ python UKBB-tabular-processing/melted_UKBB_extract.py --config-file config.brainMRI.yaml     --data-file ukb671074.melt.arrow     --dictionary-file Metadata/Data_Dictionary_Showcase.tsv     --coding-file Metadata/Codings.tsv     --category-tree-file catbrowse.txt     --data-field-prop-file field.txt     --output-prefix df_whole_brain_mri/df_whole_brain_mri_     -v
2024-03-15T13:39:00 Input configuration
2024-03-15T13:39:00 {'ArrayIDs': [],
 'Categories': [],
 'FieldIDs': [25009, 25007, 25005],
 'InstanceIDs': [],
 'SubjectIDFiles': [],
 'SubjectIDs': [],
 'convert_compound_to_list': False,
 'convert_less_than_value_continuous': None,
 'convert_less_than_value_integer': None,
 'drop_empty_strings': True,
 'drop_null_numerics': [99999, -9999999, -999999.0, -99999.0],
 'drop_null_strings': ['Do not know', 'Prefer not to answer',
                       'Time uncertain/unknown', 'Test not completed',
                       'Location could not be mapped', 'Abandoned',
                       'Next button not pressed', 'Trail not completed',
                       'Do not remember', 'Preferred not to answer'],
 'recode_data_values': True,
 'recode_field_names': True,
 'recode_wide_column_valuetypes': True,
 'replicate_non_instanced': True,
 'wide': True}
/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py:142: DeprecationWarning: `lengths` is deprecated. It has been renamed to `len`.
  pl.when(pl.col("InstanceID").list.lengths() > 1)
/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py:161: DeprecationWarning: `lengths` is deprecated. It has been renamed to `len_bytes`.
  data = data.filter(~(pl.col("FieldValue").str.lengths() == 0))
2024-03-15T13:39:00 Loading data from ukb671074.melt.arrow
join parallel: true
join parallel: true
join parallel: true
memory map ipc file
avg line length: 43.83496
std. dev. line length: 11.843322
initial row estimate: 503249
Could not mmap compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.
avg line length: 514.73535
std. dev. line length: 161.61758
initial row estimate: 7123
no. of chunks: 1 processed by: 80 threads.
no. of chunks: 80 processed by: 80 threads.
avg line length: 437.4131
std. dev. line length: 179.19662
initial row estimate: 8063
no. of chunks: 80 processed by: 80 threads.
dataframe filtered
LEFT join dataframes finished
dataframe filtered
Traceback (most recent call last):
  File "/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py", line 382, in <module>
    data, data_wide, dictionary, codings = extract_UKBB_tabular_data(
  File "/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py", line 236, in extract_UKBB_tabular_data
    data = data.collect(streaming=True, no_optimization=True)
  File "/gpfs/fs1/home/m/mchakrav/mariamcg/.virtualenvs/UKBB_tab/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 1934, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.