dalgu90 / icd-coding-benchmark

Automatic ICD coding benchmark based on the MIMIC dataset
MIT License
35 stars 5 forks source link

Add Logging Statements #18

Closed abheesht17 closed 2 years ago

abheesht17 commented 2 years ago

Resolves #14 I have added logging statements for all preprocessing-related files. This can be considered an example for adding logging statements to other scripts.

Details:

A function, get_logger(), has been defined in src/utils/text_logger.py. To use the logger in a file, paste the following snippet at the top:

from src.utils.text_logger import get_logger
logger = get_logger(__name__)

# to log a statement
logger.INFO("Hello, World!")

There are two file handlers: one that prints to the console (level: INFO) and another that writes to a log file (the log file will be saved in the logs directory, with the current timestamp, for example, logs/1646447226.938508.log. Level: DEBUG). For inner modules, use logger.debug(...), and for outer modules, use logger.info(...).

Terminal output:

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
2022-03-05 02:27:34,389 — src.modules.preprocessing_pipelines — INFO — Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-03-05 02:27:35,063 — src.modules.preprocessing_pipelines — INFO — Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-03-05 02:27:38,967 — src.modules.preprocessing_pipelines — INFO — Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz
tcmalloc: large alloc 1073741824 bytes == 0x562716a50000 @  0x7fab1b9f22a4 0x7fab09c169a5 0x7fab09c17cc1 0x7fab09c1969e 0x7fab09bea50c 0x7fab09bf7399 0x7fab09bdf97a 0x5626ccf501cd 0x5626cd042b3d 0x5626ccfc4458 0x5626ccfbf66e 0x5626ccf51aba 0x5626ccfc0108 0x5626ccfbf02f 0x5626ccf51aba 0x5626ccfc0108 0x5626ccf519da 0x5626ccfbfeae 0x5626ccfbf02f 0x5626cce90e2b 0x5626ccfc1633 0x5626ccfbf66e 0x5626ccf51aba 0x5626ccfc0cd4 0x5626ccfbf02f 0x5626ccf51aba 0x5626ccfc0cd4 0x5626ccf519da 0x5626ccfc0108 0x5626ccf519da 0x5626ccfc0108
2022-03-05 02:28:42,847 — src.modules.preprocessing_pipelines — INFO — Removing rows from code dataframe whose ICD-9 codes are not present in clinical notes
2022-03-05 02:28:43,017 — src.modules.preprocessing_pipelines — INFO — Combining code and notes dataframes
2022-03-05 02:28:45,510 — src.modules.preprocessing_pipelines — INFO — Preprocessing clinical notes
100% 52726/52726 [04:30<00:00, 195.07it/s]
2022-03-05 02:33:26,729 — src.modules.preprocessing_pipelines — INFO — Splitting data into train-test-val
2022-03-05 02:33:26,771 — src.modules.preprocessing_pipelines — INFO — Tokenizing text data
2022-03-05 02:34:01,134 — src.modules.preprocessing_pipelines — INFO — Training embedding model

logs/<timestamp>.log output:

2022-03-05 02:27:34,380 — src.modules.preprocessors — DEBUG — Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-03-05 02:27:34,382 — src.modules.preprocessors — DEBUG — Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-03-05 02:27:34,382 — src.utils.code_based_filtering — DEBUG — Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-03-05 02:27:34,382 — src.modules.dataset_splitters — DEBUG — Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-03-05 02:27:34,382 — src.utils.file_loaders — DEBUG — Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-03-05 02:27:34,388 — src.utils.file_loaders — DEBUG — Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-03-05 02:27:34,388 — src.utils.file_loaders — DEBUG — Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-03-05 02:27:34,389 — src.modules.tokenizers — DEBUG — Using Space Tokenizer to tokenize the data with the following config: None
2022-03-05 02:27:34,389 — src.modules.embeddings — DEBUG — Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-03-05 02:27:34,389 — src.modules.preprocessing_pipelines — INFO — Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-03-05 02:27:34,389 — src.utils.file_loaders — DEBUG — Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-03-05 02:27:34,867 — src.utils.file_loaders — DEBUG — Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-03-05 02:27:35,063 — src.modules.preprocessing_pipelines — INFO — Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-03-05 02:27:38,967 — src.modules.preprocessing_pipelines — INFO — Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz
2022-03-05 02:27:38,967 — src.utils.file_loaders — DEBUG — Loading file datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz as dataframe
2022-03-05 02:28:42,847 — src.modules.preprocessing_pipelines — INFO — Removing rows from code dataframe whose ICD-9 codes are not present in clinical notes
2022-03-05 02:28:43,017 — src.modules.preprocessing_pipelines — INFO — Combining code and notes dataframes
2022-03-05 02:28:45,510 — src.modules.preprocessing_pipelines — INFO — Preprocessing clinical notes
2022-03-05 02:33:21,765 — src.utils.code_based_filtering — DEBUG — top-k codes: ['401.9', '38.93', '428.0', '427.31', '414.01', '96.04', '96.6', '584.9', '250.00', '272.4', '96.71', '518.81', '99.04', '39.61', '599.0', '530.81', '96.72', '272.0', '285.9', '88.56', '244.9', '486', '285.1', '38.91', '36.15', '276.2', '496', '99.15', '995.92', 'V58.61', '507.0', '038.9', '585.9', '403.90', '311', '88.72', '305.1', '412', '37.22', '39.95', '287.5', '410.71', '276.1', 'V45.81', '424.0', 'V15.82', '511.9', '93.90', 'V45.82', '37.23']
2022-03-05 02:33:21,766 — src.utils.file_loaders — DEBUG — Saving dictionary as JSON file datasets/mimic_iii/labels.json
2022-03-05 02:33:26,729 — src.modules.preprocessing_pipelines — INFO — Splitting data into train-test-val
2022-03-05 02:33:26,771 — src.modules.preprocessing_pipelines — INFO — Tokenizing text data
2022-03-05 02:33:32,023 — src.utils.file_loaders — DEBUG — Saving dictionary as JSON file train.json
2022-03-05 02:33:57,644 — src.utils.file_loaders — DEBUG — Saving dictionary as JSON file val.json
2022-03-05 02:33:58,740 — src.utils.file_loaders — DEBUG — Saving dictionary as JSON file test.json
2022-03-05 02:34:01,134 — src.modules.preprocessing_pipelines — INFO — Training embedding model
2022-03-05 02:34:01,134 — src.modules.embeddings — DEBUG — Training Word2Vec on clinical notes
dalgu90 commented 2 years ago

Sorry for making a late review. I've done some research on Python logging and now I think we can improve the current logging.
We don't have to create logger object at every module. Instead, we create one instance in the logger in a helper module:

Also, I think we can correct the name of StreamHandler from file_hander to file_handler. and remove \n at the beginning of the format of StreamHandler.
This is my quick suggestion, so I might be wrong at some point, in which case please correct me.

image
abheesht17 commented 2 years ago

Sorry for making a late review. I've done some research on Python logging and now I think we can improve the current logging. We don't have to create logger object at every module. Instead, we create one instance in the logger in a helper module:

  • We need to import the logger instance, not a helper function that uses the logger. (like I called logger.info() in the main.py. instead of using log_info())
  • Also the handler should have %(module)s format to log the name of module where the logger is called correctly.

Also, I think we can correct the name of StreamHandler from file_hander to file_handler. and remove \n at the beginning of the format of StreamHandler. This is my quick suggestion, so I might be wrong at some point, in which case please correct me.

image

Thank you, @dalgu90, for a thorough analysis!

This is what I'd initially done and it hadn't worked for me:

In log_module.py:

def define_logger(args):
    ...
    return logger

In preprocessing_pipelines.py:

logger = log_module.define_logger(args)

This didn't seem to work. Seems like I should return a global logger, instead of instantiating a logger for every file. Thank you! I'll make the changes.

abheesht17 commented 2 years ago

@dalgu90 , I tried what you'd suggested above in commit https://github.com/dalgu90/icd-coding-benchmark/pull/18/commits/153e7cb87e45d59e05057f258aefccef3bfe064f. It didn't work. Have I done something wrong?

Terminal output:

2022-02-23 01:50:52,019:INFO:src.utils.text_logger: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}

2022-02-23 01:50:52,020:INFO:src.utils.text_logger: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}

2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json

2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}

2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/train_split.json as dictionary

2022-02-23 01:50:52,036:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/val_split.json as dictionary

2022-02-23 01:50:52,039:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/test_split.json as dictionary

2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Space Tokenizer to tokenize the data with the following config: None

2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}

2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe

2022-02-23 01:50:52,594:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe

2022-02-23 01:50:52,801:INFO:src.utils.text_logger: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

2022-02-23 01:50:56,328:INFO:src.utils.text_logger: Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz

2022-02-23 01:50:56,330:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz as dataframe

logs/datasets.log:

2022-02-23 01:50:52,019:INFO:src.utils.text_logger: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-23 01:50:52,020:INFO:src.utils.text_logger: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-23 01:50:52,036:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-23 01:50:52,039:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,594:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,801:INFO:src.utils.text_logger: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:56,328:INFO:src.utils.text_logger: Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz
2022-02-23 01:50:56,330:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz as dataframe
dalgu90 commented 2 years ago

Thanks for updating. I think we can see the name of caller when we use %(module)s instead of %(name)s.
Here are the possible formats that we can use link.
Also we can have a custom formatter if we want a full module path (like src.modules.preprocessors instead of just preprocessors), in which case I can willingly contribute to your branch.

Terminal output:

2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}

2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}

2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json

2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}

2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary

2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary

2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary

2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None

2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}

2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe

2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe

2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

logs/dataset.log

2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
abheesht17 commented 2 years ago

Thanks for updating. I think we can see the name of caller when we use %(module)s instead of %(name)s. Here are the possible formats that we can use link. Also we can have a custom formatter if we want a full module path (like src.modules.preprocessors instead of just preprocessors), in which case I can willingly contribute to your branch.

Terminal output:

2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}

2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}

2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json

2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}

2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary

2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary

2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary

2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None

2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}

2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe

2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe

2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

logs/dataset.log

2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz

Ohh, I see. But there is another problem. Logging statements meant for stdout are logged to both stdout and log file. Same goes for logging statements meant for log file.

abheesht17 commented 2 years ago

Updates: