btheodorou99 / HALO_Inpatient

21 stars 7 forks source link

total_vocab_size is much bigger than default #6

Closed BreezeHavana closed 11 months ago

BreezeHavana commented 11 months ago

Thanks for the detailed answer for preprocessing, it really helped a lot. There are a few problems about training, I followed the steps as in continuous_variables/readme.md and set code_vocab_size=44529 and label_vocab_size=10205 as are printed in genDatasetContinuous.py. The lab_vocab_size and continuous_vocab_size are the same as default. After that I set total_vocab_size=54989 as the sum of all vocab_sizes, which is much bigger than default 14487. And no doubt that my 4090 can not support it. So I am wondering that are there anything wrong with code_vocab_size and label_vocab_size? Thanks in advance.

btheodorou99 commented 11 months ago

Yes, both of those sizes are much larger than expected and far too large for any standard GPU. code_vocab_size should be around 10k, and label_vocab_size should be less than 100 (should be specifically 25). Can you inspect the indexToCode and idToLabel .pkl files to see what they look like to see if there are any obvious errors there to help clarify things?

BreezeHavana commented 11 months ago

Hi,I debugged genDatasetContinuous.py. Take patient 82574 as an example, there seems to be a combination of diagnoses, procedures, and medications in

line 168-186

. Before that, data['82574']['visits'] is a list with 4 lists and a float like

[(['28529', '3051', 'V5869', '2449', '53081', '311', '6826', '19889', '1991', '2720', '49390', '4589', '28411', '7804', '4400', 'V4589', '1985'], ['9229'], ['51079-0524-0', '00409-6729-4', '51079-0456-0', '68094-0503-1', '00456-0662-0', '00074-4341-3', '00406-0552-2', '00173-0682-4', '00121-0431-0', '10939-0337-3', '00245-0041-1', '00143-9897-1', '00093-4356-3', '51079-0386-0', '00074-7068-1', '63323-0262-1', '00536-4077-1', '00054-3270-9', '00406-8330-2', '00008-0841-9', '63739-0354-0', '00074-9296-3', '51079-0542-0', '00904-2725-1', '00904-2244-1', '00173-0719-0', '00172-4382-0', '00000-0000-0', '00904-5165-1'], 56.15890410958904, [])]

The number of string values is 47 in total. And after that combination, data['82574']['visits'] is a list with 4 lists and all string values are gathered to the 1st list like

['4589', '3051', '2449', '1991', '2720', 'V5869', '4400', '49390', '53081', '19889', '6826', '7804', 'V4589', '311', '1985', '28529', '28411', '9229', 'Senna', 'Sulfameth/Trimethoprim DS', 'Levothyroxine Sodium', 'Ondansetron', 'Milk of Magnesia', 'Simvastatin', 'Fluticasone Propionate NASAL', 'Docusate Sodium', 'Gabapentin', 'Prochlorperazine', 'Magnesium Oxide', 'Fluticasone Propionate 110mcg', 'Potassium Chloride (Powder)', 'Niacin SR', 'Heparin', 'Potassium Chloride', 'OxycoDONE (Immediate Release) ', 'Levothyroxine Sodium', 'Levothyroxine Sodium', '5% Dextrose', 'Ibuprofen Suspension', 'Magnesium Sulfate', 'Fluoxetine', 'Fish Oil (Omega 3)', 'Pantoprazole', 'Lorazepam', 'Morphine SR (MS Contin)', 'Albuterol Inhaler', 'Cephalexin']

And the number of string values is still 47. The combination gathered all string values to the 1st list, and indexToCode counts the length of 1st list, so indexToCode became huge. I guess that might be the problem. If not, please let me know. Thanks!

BreezeHavana commented 11 months ago

The variable code_to_index is like

{'7018': 0, 'Sodium Chloride 3% Inhalation Soln': 1, '42654': 2, '7994': 3, 'Pioglitazone': 4, '8675': 5, '30012': 6, 'Acyclovir': 7, '43822': 8, 'Melphalan': 9, '80176': 10, 'Latuda': 11, 'E9479': 12, '95892': 13, '71907': 14, '5798': 15, '7921': 16, '7140': 17, '3249': 18, '45181': 19, '1734': 20, '9547': 21, '3539': 22, 'Oseltamivir': 23, '37182': 24, '4472': 25, '8794': 26,

and it does have 10205 elements totally

BreezeHavana commented 11 months ago

Also, there are 44529 patients in variable 'data' so code_vocab_size is 44529

btheodorou99 commented 11 months ago

My apologies, I the code_vocab_size should be the size of the code_to_index variable (so 10205), and the label_vocab_size should be the size of idToLabel or the label vector (I believe 25). I will improve the print statements at the bottom of the genContinuousData file, and please let me know if the printed values don't align with these amounts.

BreezeHavana commented 11 months ago

Thanks! These 2 variables are alright. One more thing, when I run train_model.py, it threw an out-of-bounds error in line 59. the config.n_ctx is set to 150 meanwhile the max length of visits in line 58 is much larger, so when j+2 meets 150 it throws an error. I assume that n_ctx should be larger than max_length+2? And are there any other constants in config.py I should change? Thanks!

BreezeHavana commented 11 months ago

Thanks! These 2 variables are alright. One more thing, when I run train_model.py, it threw an out-of-bounds error in line 59. the config.n_ctx is set to 150 meanwhile the max length of visits in line 58 is much larger, so when j+2 meets 150 it throws an error. I assume that n_ctx should be larger than max_length+2? And are there any other constants in config.py I should change? Thanks!

And max length of visits in train_ehr_dataset is 3907

btheodorou99 commented 11 months ago

Yes, my suggestions would be to increase the n_ctx and n_positions (they should be the same here, though in some settings they can be different which is why we have both) to whatever the most you can fit on your GPU. However, you almost certainly don't want 3907 as that is an outlier and most are much shorter. So, you probably want to add some line to shorten the generated patient record to that max value, either in the genContinuousData or train/test files (depending on if you want the flexibility to adjust later without rebuilding the dataset).

Beyond that, no other constants should need to be changed, though you are welcome to play around with model hyper parameters especially n_embd and n_head. You can also adjust batch_size as something that will affect GPU memory usage if needed.

BreezeHavana commented 11 months ago

Thanks! It worked.