CoBrALab / UKBB-tabular-processing

Scripts to handle the tabular data associated with the UK BioBank
8 stars 1 forks source link

polars.exceptions.ComputeError: found more fields than defined in 'Schema' #21

Closed mariamcguinness closed 4 months ago

mariamcguinness commented 4 months ago

we are running into an error while trying to extract data from the UKBiobank. We ran this line

$ python UKBB-tabular-processing/melted_UKBB_extract.py --config-file config.brainMRI.yaml \
    --data-file ukb671074.melt.arrow \
    --dictionary-file Metadata/Data_Dictionary_Showcase.tsv \
    --coding-file Metadata/Codings.tsv \
    --category-tree-file catbrowse.txt \
    --data-field-prop-file field.txt \
    --output-prefix df_whole_brain_mri/df_whole_brain_mri_ \
    -v

This is what the config.BrainMRI.yaml file looks like :

## Filtering section

# List of FieldIDs to extract, null to extract all
# See Data_Dictionary_Showcase.tsv for a concise list
FieldIDs:
# Ex: Sex
 - 25009
 - 25007
 - 25005

# Instance aka timepoint, 1-4, null for all
InstanceIDs:
  - null

# Specific subjects to extract, null for all
SubjectIDs:
  - null
# Alternatively, one or more filenames which a flat lists of SubjectIDs
SubjectIDFiles:
  - null
# For fields with array components, null for all
ArrayIDs:
  - null
# Use pre-defined Categories of FieldIDs, added to list above, null for none
Categories:
 - null

## Output control section

# Replicate non-instanced data (aka Sex, other single-point measurements)
# across all instances
replicate_non_instanced: true

# Use data dictionary to recode FieldIDs as <Name>_<FieldID>
recode_field_names: true

# Use data dictionary and coding file to replace FieldValues with decoded entries
recode_data_values: true

# Some FieldValues were saved as empty strings instead of NA, drop these
drop_empty_strings: true

# Strings to drop as null
drop_null_strings:
  - "Do not know"
  - "Prefer not to answer"
  - "Time uncertain/unknown"
  - "Test not completed"
  - "Location could not be mapped"
  - "Abandoned"
  - "Next button not pressed"
  - "Trail not completed"
  - "Do not remember"
  - "Preferred not to answer"

# Numeric vaues to map to null
drop_null_numerics:
  - 99999
  - -9999999
  - -999999.000
  - -99999.000

## Wide output control

# Produce a wide aka pivoted DataFrame in addition to the filtered narrow frame
wide: true

# Use data dictionary to assign proper datatypes to columns in wide output
# Only applies to binary arrow format
recode_wide_column_valuetypes: true

# Attempt to split compound type FieldValues into a list in wide output
convert_compound_to_list: false

# When recode_wide_column_valuetype=true some values from recode_data_values=true
# some values will break setting column datatypes
# Substitute strings to values set below
convert_less_than_value_integer: null
convert_less_than_value_continuous: null

This is the content of our directory

(UKBB_tab) [mariamcg@nia1269 tabular]$ ls *
catbrowse.txt         field.txt         slurm-12369191.out
config.brainMRI.yaml  ids_t1w_ses2.txt  ukb671074.melt.arrow

df_whole_brain_mri:
df_whole_brain_mri_conversion.log

Metadata:
Codings.tsv  Data_Dictionary_Showcase.tsv

UKBB-tabular-processing:
config.py             melted_UKBB_extract.py  README.md         ukb_awk
config.template.yaml  __pycache__             requirements.txt

And this is the error we get;

(UKBB_tab) [mariamcg@nia1269 tabular]$ python UKBB-tabular-processing/melted_UKBB_extract.py --config-file config.brainMRI.yaml     --data-file ukb671074.melt.arrow     --dictionary-file Metadata/Data_Dictionary_Showcase.tsv     --coding-file Metadata/Codings.tsv     --category-tree-file catbrowse.txt     --data-field-prop-file field.txt     --output-prefix df_whole_brain_mri/df_whole_brain_mri_     -v
2024-03-15T13:39:00 Input configuration
2024-03-15T13:39:00 {'ArrayIDs': [],
 'Categories': [],
 'FieldIDs': [25009, 25007, 25005],
 'InstanceIDs': [],
 'SubjectIDFiles': [],
 'SubjectIDs': [],
 'convert_compound_to_list': False,
 'convert_less_than_value_continuous': None,
 'convert_less_than_value_integer': None,
 'drop_empty_strings': True,
 'drop_null_numerics': [99999, -9999999, -999999.0, -99999.0],
 'drop_null_strings': ['Do not know', 'Prefer not to answer',
                       'Time uncertain/unknown', 'Test not completed',
                       'Location could not be mapped', 'Abandoned',
                       'Next button not pressed', 'Trail not completed',
                       'Do not remember', 'Preferred not to answer'],
 'recode_data_values': True,
 'recode_field_names': True,
 'recode_wide_column_valuetypes': True,
 'replicate_non_instanced': True,
 'wide': True}
/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py:142: DeprecationWarning: `lengths` is deprecated. It has been renamed to `len`.
  pl.when(pl.col("InstanceID").list.lengths() > 1)
/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py:161: DeprecationWarning: `lengths` is deprecated. It has been renamed to `len_bytes`.
  data = data.filter(~(pl.col("FieldValue").str.lengths() == 0))
2024-03-15T13:39:00 Loading data from ukb671074.melt.arrow
join parallel: true
join parallel: true
join parallel: true
memory map ipc file
avg line length: 43.83496
std. dev. line length: 11.843322
initial row estimate: 503249
Could not mmap compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.
avg line length: 514.73535
std. dev. line length: 161.61758
initial row estimate: 7123
no. of chunks: 1 processed by: 80 threads.
no. of chunks: 80 processed by: 80 threads.
avg line length: 437.4131
std. dev. line length: 179.19662
initial row estimate: 8063
no. of chunks: 80 processed by: 80 threads.
dataframe filtered
LEFT join dataframes finished
dataframe filtered
Traceback (most recent call last):
  File "/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py", line 382, in <module>
    data, data_wide, dictionary, codings = extract_UKBB_tabular_data(
  File "/gpfs/fs0/scratch/m/mchakrav/mariamcg/gender_mri/tabular/UKBB-tabular-processing/melted_UKBB_extract.py", line 236, in extract_UKBB_tabular_data
    data = data.collect(streaming=True, no_optimization=True)
  File "/gpfs/fs1/home/m/mchakrav/mariamcg/.virtualenvs/UKBB_tab/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 1934, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.
gdevenyi commented 4 months ago

Thanks for your report, very detailed.

One thing missing however, what are you trying to do, what is your intent with this configuration, what are you going to do with the data you're extracting. Thanks.

gdevenyi commented 4 months ago

Please also provide the version of polars you are using (I presume conda list to find it?)

mariamcguinness commented 4 months ago

My intent is to extract the following Field IDs: 25009, 25007, 25005, which are brain MRI measures for the second time point. We do not use conda we are working in a python virtual environment, I am using polars 0.20.15

gdevenyi commented 4 months ago

The only InstanceIDs with MRI data are 2 and 3, can you adjust your yml file and try again.

mariamcguinness commented 4 months ago

I tried instances 2 and 3, and still get the same error.

gdevenyi commented 4 months ago

Please try polars=0.18.0 the last known version which I worked with, to see if its a change in polars.

mariamcguinness commented 4 months ago

Changing the version of polars fixed the issue, thank you!

gdevenyi commented 4 months ago

I will limit the version in requirements.txt so this doesn't happen again and look into what was changed which caused this.