DataBiosphere / data-explorer-indexers

BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

participant_id column should be case insensitive #133

Closed wnojopra closed 5 years ago

wnojopra commented 5 years ago

Earlier today the amp_pd tables updated, and some of the tables have PARTICIPANT_ID while others have participant_id. Indexing fails with:

indexer_1_bf5ee5f6c002 | File "indexer.py", in _add_participant_table_to_participant_docs indexer_1_bf5ee5f6c002 | participant_id = row[participant_id_column] indexer_1_bf5ee5f6c002 | KeyError: u'PARTICIPANT_ID'

According to https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#case-sensitivity , column names are case insensitive, so it'd be nice if in this case PARTICIPANT_ID was treated the same as participant_id.

melissachang commented 5 years ago

What if you read the 2 tables into a notebook pandas dataframe? I suspect the columns will be different for pandas.

wnojopra commented 5 years ago

Correct. I guess the issue is with BigQuery treating column names as case insensitive, while most downstream consumers (like data-explorer-indexer and most Python libs) are case sensitive. So if I did a query in BigQuery something like "select * from foo where PARTICIPANT_ID='abc'" , I'd guess the same query would give different results if done in pandas.

I think for us we'd have to decide which side to be consistent with. I could see benefits for both sides. 1) We expect the input tables to have consistent capitalization. 2) Our indexer behaves the same way as BigQuery does.

If we go with 1) , at least indexing will error out, so that's actually kind of nice.

In any case, this probably isn't too large of a priority, but I did want to raise awareness for it.

melissachang commented 5 years ago

In general I try to avoid fixes at the Data Explorer level, because Data Explorer is just one of many tools. It would be confusing for researchers if something worked with DE but not other tools (eg notebooks). I prefer to fix underlying data. So closing this issue for now. We can reopen if we encounter this with other datasets, or if it comes up again for AMP PD.