deepmagd / shenaniGAN

Work in progress
MIT License
1 stars 0 forks source link

One-hot embedding of tabular x-ray diagnosis data #7

Closed alecokas closed 4 years ago

alecokas commented 4 years ago

Comments required on two points please:

  1. I wasn't really sure where these functions should sit, can you think of anything better than utils/data_helpers.py? at the moment I run then like this:

    > from utils.data_helpers import encode_tabular_data, build_encoding_map, load_tabular_data, concat_columns_into_vector
    > tab_xray_df = load_tabular_data('data/CheXpert-v1.0-small/train.csv')
    > encoded_df = encode_tabular_data(tab_xray_df)
    >  embedding_dict = concat_columns_into_vector(encoded_df)
  2. Can you see any problem with the way that I have gone about assigning different categories to numerical integers? I'm worried that by doing this, I've implicitly made certain categories more related to each other than others. For instance, if I have three categories ['X positive, X negative, no record], and I assign these to the categoricals [0, 1, 2]. Is building in that X positive is less similar to no record than X positive is to X negative an issue? Obviously we will train some embedding on top of this, but maybe the initialization is dangerous?

Probably don't merge yet @Devin-Taylor

Devin-Taylor commented 4 years ago
  1. they all seem quite specific to the xray dataset, maybe we should split dataloaders into dataloaders/birds, dataloaders/flowers, dataloaders/xrays and then you can have loaders.py and helper_fxns.py inside each? But might be overkill. Alternatively, make them a bit more generic if possible and then leave them in this file. But in all honesty, how they currently are is also fine!
Devin-Taylor commented 4 years ago
  1. can you not one-hot encode them instead?
alecokas commented 4 years ago
  1. Yeah I like the idea of splitting them up, but let's maybe see how things progress a bit before committing to that much SE
  2. I was thinking that, and then just concat all the one-hot encodings together?
Devin-Taylor commented 4 years ago
  1. haha yeah lots of effort, maybe move that to a separate issue and backlog it for now?
  2. yeah, I think that is pretty reasonable
alecokas commented 4 years ago

This can now be merged if you are happy with it @Devin-Taylor, using one-hot encoding for categorical variables. Keep the branch open as I'll actually work on the embedding network here