Closed abheesht17 closed 2 years ago
Sorry for making a late review. I've done some research on Python logging and now I think we can improve the current logging.
We don't have to create logger object at every module. Instead, we create one instance in the logger in a helper module:
logger.info()
in the main.py. instead of using log_info()
)%(module)s
format to log the name of module where the logger is called correctly.Also, I think we can correct the name of StreamHandler from file_hander
to file_handler
. and remove \n
at the beginning of the format of StreamHandler.
This is my quick suggestion, so I might be wrong at some point, in which case please correct me.
Sorry for making a late review. I've done some research on Python logging and now I think we can improve the current logging. We don't have to create logger object at every module. Instead, we create one instance in the logger in a helper module:
- We need to import the logger instance, not a helper function that uses the logger. (like I called
logger.info()
in the main.py. instead of usinglog_info()
)- Also the handler should have
%(module)s
format to log the name of module where the logger is called correctly.Also, I think we can correct the name of StreamHandler from
file_hander
tofile_handler
. and remove\n
at the beginning of the format of StreamHandler. This is my quick suggestion, so I might be wrong at some point, in which case please correct me.
Thank you, @dalgu90, for a thorough analysis!
This is what I'd initially done and it hadn't worked for me:
In log_module.py
:
def define_logger(args):
...
return logger
In preprocessing_pipelines.py
:
logger = log_module.define_logger(args)
This didn't seem to work. Seems like I should return a global logger, instead of instantiating a logger for every file. Thank you! I'll make the changes.
@dalgu90 , I tried what you'd suggested above in commit https://github.com/dalgu90/icd-coding-benchmark/pull/18/commits/153e7cb87e45d59e05057f258aefccef3bfe064f. It didn't work. Have I done something wrong?
Terminal output:
2022-02-23 01:50:52,019:INFO:src.utils.text_logger: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-23 01:50:52,020:INFO:src.utils.text_logger: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-23 01:50:52,036:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-23 01:50:52,039:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,594:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,801:INFO:src.utils.text_logger: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:56,328:INFO:src.utils.text_logger: Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz
2022-02-23 01:50:56,330:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz as dataframe
logs/datasets.log
:
2022-02-23 01:50:52,019:INFO:src.utils.text_logger: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-23 01:50:52,020:INFO:src.utils.text_logger: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-23 01:50:52,021:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-23 01:50:52,036:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-23 01:50:52,039:INFO:src.utils.text_logger: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:52,042:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,594:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-23 01:50:52,801:INFO:src.utils.text_logger: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-23 01:50:56,328:INFO:src.utils.text_logger: Loading noteevents CSV file: datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz
2022-02-23 01:50:56,330:INFO:src.utils.text_logger: Loading file datasets/mimic_iii/1.4/NOTEEVENTS.csv.gz as dataframe
Thanks for updating. I think we can see the name of caller when we use %(module)s
instead of %(name)s
.
Here are the possible formats that we can use link.
Also we can have a custom formatter if we want a full module path (like src.modules.preprocessors
instead of just preprocessors
), in which case I can willingly contribute to your branch.
Terminal output:
2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
logs/dataset.log
2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}}
2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}}
2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json
2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'}
2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary
2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary
2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None
2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}}
2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe
2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
Thanks for updating. I think we can see the name of caller when we use
%(module)s
instead of%(name)s
. Here are the possible formats that we can use link. Also we can have a custom formatter if we want a full module path (likesrc.modules.preprocessors
instead of justpreprocessors
), in which case I can willingly contribute to your branch.Terminal output:
2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}} 2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}} 2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json 2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'} 2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary 2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary 2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary 2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None 2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}} 2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz 2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe 2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe 2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
logs/dataset.log
2022-02-22 22:33:50,646:INFO:preprocessors: Initialising Clinical Note Processor with the following config: {'to_lower': {'perform': True}, 'remove_punctuation': {'perform': True}, 'remove_numeric': {'perform': True}, 'remove_stopwords': {'perform': True, 'params': {'stopwords_file_path': None, 'remove_common_medical_terms': True}}, 'stem_or_lemmatize': {'perform': True, 'params': {'stemmer_name': 'nltk.WordNetLemmatizer'}}, 'truncate': {'perform': True, 'params': {'max_length': 2000}}} 2022-02-22 22:33:50,647:INFO:preprocessors: Initialising Code Processor with the following config: {'top_k': 50, 'code_type': 'both', 'add_period_in_correct_pos': {'perform': True}} 2022-02-22 22:33:50,647:INFO:code_based_filtering: Finding top-k codes with the following args: k = 50, label_save_path = datasets/mimic_iii/labels.json 2022-02-22 22:33:50,647:INFO:dataset_splitters: Using CAML official split to split data into train-test-val with the following config: {'train_hadm_ids_path': 'datasets/mimic_iii/train_split.json', 'val_hadm_ids_path': 'datasets/mimic_iii/val_split.json', 'test_hadm_ids_path': 'datasets/mimic_iii/test_split.json'} 2022-02-22 22:33:50,647:INFO:file_loaders: Loading JSON file datasets/mimic_iii/train_split.json as dictionary 2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/val_split.json as dictionary 2022-02-22 22:33:50,652:INFO:file_loaders: Loading JSON file datasets/mimic_iii/test_split.json as dictionary 2022-02-22 22:33:50,652:INFO:tokenizers: Using Space Tokenizer to tokenize the data with the following config: None 2022-02-22 22:33:50,652:INFO:embeddings: Using Word2Vec to train embeddings on clinical notes with the following config: {'embedding_dir': 'datasets/mimic_iii/word2vec/', 'unk_token': '<unk>', 'pad_token': '<pad>', 'word2vec_params': {'vector_size': 100, 'min_count': 3, 'epochs': 5}} 2022-02-22 22:33:50,652:INFO:preprocessing_pipelines: Loading code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz 2022-02-22 22:33:50,652:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz as dataframe 2022-02-22 22:33:51,560:INFO:file_loaders: Loading file datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz as dataframe 2022-02-22 22:33:51,914:INFO:preprocessing_pipelines: Preprocessing code CSV files: datasets/mimic_iii/1.4/DIAGNOSES_ICD.csv.gz, datasets/mimic_iii/1.4/PROCEDURES_ICD.csv.gz
Ohh, I see. But there is another problem. Logging statements meant for stdout are logged to both stdout and log file. Same goes for logging statements meant for log file.
Updates:
Resolves #14 I have added logging statements for all preprocessing-related files. This can be considered an example for adding logging statements to other scripts.
Details:
A function,
get_logger()
, has been defined insrc/utils/text_logger.py
. To use the logger in a file, paste the following snippet at the top:There are two file handlers: one that prints to the console (level:
INFO
) and another that writes to a log file (the log file will be saved in thelogs
directory, with the current timestamp, for example,logs/1646447226.938508.log
. Level:DEBUG
). For inner modules, uselogger.debug(...)
, and for outer modules, uselogger.info(...)
.Terminal output:
logs/<timestamp>.log
output: