Open kirilklein opened 1 week ago
Tokenization is taking around 3h now 1614ba1136ab2ebc002d8308176bd05b4cb7784f on azure, which is still slow compared to the 10m the old version (having features in memory and looping over patients) took. Data has increased since then but not by that much. Current data (with med and diag only) takes 44gib of RAM. -> we need to handle oom -> optimize dask code instead
Biggest issues: Vocabulary creation and writing is very slow Although, probably some operations are pushed into writing See whether we can avoid recomputation where possible
Most of the computation time in create_data is now spend on tokenization There are obvious inefficiencies that need to be addressed.
For example in the
_add_token
function which is applied repeatedly, we do sorting by PID and abspos twice. We should sort once outside the function if possible. Also, probably doingset_index("PID")
before to have pids correctly partitioned will help. Especially in functions like_get_first_event
andget_segment_change
Furthermore the tokenization itself can be improved. For instance, we can get rid of the lambda function intokenize_frozen
:Similarly, we can precompute the mapping in the dynamic case and just do .map once