PathologyDataScience / NuCLS

NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation
MIT License
45 stars 13 forks source link

could you provide the documentation/instructions on how to use your code? #2

Open quincy-125 opened 3 years ago

quincy-125 commented 3 years ago

Hello, I plan to use your code on our own svs H&E-stained Whole Slide Images. But I did not find any documentation in your repo, could you provide more information on the input data preprocessing and code instructions? Thanks!

kheffah commented 3 years ago

Hello @quincy-125 ,

Thank you for your interest in running our code. You're right, the code is still insufficiently documented, and I plan to expand the documentation substantially over the next week once I get a few priorities out of the way. For now, here's a sample code snippet to train NuCLS using this code base. Again, more expansion to documentation to come in the coming 1-2 weeks.

import sys
import os
from os.path import join as opj
import argparse

parser = argparse.ArgumentParser(description='Train nucleus model.')
parser.add_argument('-f', type=int, default=[1], nargs='+', help='fold(s) to run')
parser.add_argument('-g', type=int, default=[0], nargs='+', help='gpu(s) to use')
parser.add_argument('--qcd', type=int, default=1, help='use QCd data for training?')
parser.add_argument('--train', type=int, default=1, help='train?')
parser.add_argument('--vistest', type=int, default=1, help='visualize results on testing?')
args = parser.parse_args()
args.qcd = bool(args.qcd)
args.train = bool(args.train)
args.vistest = bool(args.vistest)

# GPU allocation MUST happen before importing other modules
from GeneralUtils import save_configs, maybe_mkdir, AllocateGPU
AllocateGPU(GPUs_to_use=args.g)

from nucleus_model.MiscUtils import load_saved_otherwise_default_model_configs
from configs.nucleus_model_configs import CoreSetQC, CoreSetNoQC
from nucleus_model.NucleusWorkflows import run_one_maskrcnn_fold

# %%===========================================================================
# Configs

model_name = '002_MaskRCNN_tmp'
dataset_name = CoreSetQC.dataset_name if args.qcd else CoreSetNoQC.dataset_name
all_models_root = opj(BASEPATH, f'results/tcga-nucleus/models/{dataset_name}/')
model_root = opj(all_models_root, model_name)
maybe_mkdir(model_root)

# load configs
configs_path = opj(model_root, 'nucleus_model_configs.py')
cfg = load_saved_otherwise_default_model_configs(configs_path=configs_path)

# for reproducibility, copy configs & most relevant code file to results
if not os.path.exists(configs_path):
    save_configs(
        configs_path=opj(BASEPATH, 'configs/nucleus_model_configs.py'),
        results_path=model_root)
save_configs(
    configs_path=os.path.abspath(__file__),
    results_path=model_root, warn=False)
save_configs(
    configs_path=opj(BASEPATH, 'nucleus_model/NucleusWorkflows.py'),
    results_path=model_root, warn=False)

# %%===========================================================================
# Now run

for fold in args.f:
    run_one_maskrcnn_fold(
        fold=fold, cfg=cfg, model_root=model_root, model_name=model_name,
        qcd_training=args.qcd, train=args.train, vis_test=args.vistest)

# %%===========================================================================
quincy-125 commented 3 years ago

Thanks! Quick question here, what's the input data for your model? Is an entire svs slide or the png image patches? Do you need the binary mask image with the annotated cells? Thanks!

kheffah commented 3 years ago

Sure thing. For model training, the input is the patches and however you prefer to load the annotations. In my case, I preferred to use a the annotation csv files. For model inference using the trained models, you can use the trained weights on anything you'd like, including stand-alone .png images or tiles from a slide that are fetched using, say, openslide.

quincy-125 commented 3 years ago

So that means I need to load image patches and a csv file for how many cells in each patch? For example the csv file ideally should have at least 2 columns, patch names and corresponding number of cells? Am I interpret what you just said correctly? Thanks for your immediate response, highly appreciate!

kheffah commented 3 years ago

Yes, you are correct. Each patch RGB image has an associated csv file, which contains the coordinates and classes of the nuclei in that patch. You can read more about the data format in this page. Let me know if anything is unclear, always happy to help :)

quincy-125 commented 3 years ago

Thanks a lot, could u provide me a link with your test sample data that I could play with? Btw, I did not see your email on the manuscript, if u don't mind, could you leave ur email here? If I run into any problems, I probably need to reach out to you again.

kheffah commented 3 years ago

Sure thing. When I prototype, I usually use the Corrected single-rater dataset, which you can find the link for in this page. You can either download the full dataset or just a few images to play around with for prototyping. Actually, anything that happens in this repository sends me an email directly (I "watch" it), so the preferred way to ask questions is here through github issues. That being said, if you have any requests or questions that are not suitable for public viewing, feel free to email me at: mtageld@emory.edu .

quincy-125 commented 3 years ago

Sounds good. Thank you

demonhawk007 commented 3 years ago

Hi, I have a query regarding where to locate "configs.nucleus_style_defaults". Could you please help me? I couldn't find it in the dependencies.

kheffah commented 3 years ago

Hi @demonhawk007 , Thank you for the question. That was just a typo. I renames the folder config to configs within this repository. You should now be able to find it.

demonhawk007 commented 3 years ago

The issue is not the typo. I had already corrected that. I am unable to find "nucleus_style_defaults". I receive "ModuleNotFoundError: No module named 'configs.nucleus_style_defaults'"

kheffah commented 3 years ago

@demonhawk007 Ah I see what you mean. Well, you are right this repo is not yet a "package" as there is no setup.py etc. For now, you can just make sure to add the path of the repository root at the start of your script:

import sys
sys.path.insert(0, '/path/to/this/repository')

Then all of these imports would work. The file you are looking for is here.

demonhawk007 commented 3 years ago

What I am trying to say is "nucleus_style_defaults" is supposed to be a python file inside configs folder which can be found here. Since you are using "from configs import nucleus_style_defaults", it is unable to locate that. On the other hand we can see that "nucleus_model_configs.py" is present. That is why we are able to import that. Hope I am able to make my point.

kheffah commented 3 years ago

@demonhawk007 Woops! Thank you for pointing this out. In my head, I was confusing nucleus_model_configs.py with nucleus_style_defaults.py. OK, I added it, and you should see it now here.

demonhawk007 commented 3 years ago

Hi @kheffah , Thank you. I have another follow up query. The file which you mentioned here image, what does this refers to? A sample would be extremely helpful.

kheffah commented 3 years ago

@demonhawk007 Sure thing. __file__ here is literally the same script you are running. For example, if you run:

import os
print(os.path.abspath(__file__))

from a script file called myPythonFile.py, it would print something like /path/to/myPythonFile.py. See here for example.

demonhawk007 commented 3 years ago

Thank you @kheffah. I am very close to executing the setup. Currently facing issues with sqlitedb setup. I will revert here if an clarification is needed.

demonhawk007 commented 3 years ago

Hi @kheffah , I would like to know if there is a separate utility/dependency file for Database, as I dont see where the tables are created. image

zunairaR commented 3 years ago

Salam, I was thinking of using this dataset, but I couldn't understand what is the image size to the model? As the paper states the crop size is 300. Does it mean that all the rgb patch images (.png format, with variable sizes) are resized to 300(pxls), or patches are extracted from patch images using some sliding window technique? Secondly, why the number of nuclei differ for each of detection, segmentation and classification task? Thanks

zunairaR commented 3 years ago

I have one more query regarding the available FOVs, have they already undergone the stain normalization? or we need to do it before preparing the data for DL? Thanks

kheffah commented 3 years ago

@demonhawk007 Apologies for the delay. Actually, the database is just an sqlite version of the csv files that were publicly shared. If you convert the csv to sqlite, you would have the data you need. The only reason we shared them as csv is because these are more platform agnostic and widely recognized.

kheffah commented 3 years ago

@zunairaR Salam, thank you for your question. I hope my response to the other issue answered your questions about the crop size and stain normalization. As for your second question, the number of nuclei will be different for each image, and that's OK. In fact, MaskRCNN (and by extension, our modified NuCLS version) already handles this by setting a very large maximum number of detections per image, say 300 nuclei, then using non-maximum suppression to remove detections that are unrealistically close and overlapping.

Amshoreline commented 3 years ago

@demonhawk007 Apologies for the delay. Actually, the database is just an sqlite version of the csv files that were publicly shared. If you convert the csv to sqlite, you would have the data you need. The only reason we shared them as csv is because these are more platform agnostic and widely recognized.

Could you provide the sqlite version of the csv files? Thanks

player1321 commented 3 years ago

@kheffah I tried to convert the csv to sqlite with following code, but there are still some missing columns, can you provide the sqlite file or some scripts to generate the sqlite from csv?

def get_fov_meta(csv_file):
    df = pd.read_csv(csv_file)
    df['fov_id'] = df['fovname'].apply(lambda x: x.split('_')[1])
    df['slide_name'] = df['fovname'].apply(lambda x: x.split('_')[0])
    return df

def get_annotation_elements(csv_folder):
    csv_file_list = [i for i in os.listdir(csv_folder) if 'TCGA' in i]
    df_all = pd.DataFrame()
    for csv_file in csv_file_list:
        df = pd.read_csv(os.path.join(csv_folder, csv_file))
        df['fov_id'] = csv_file[:-4].split('_')[1]
        df['slide_name'] = csv_file[:-4].split('_')[0]
        df['fovname'] = csv_file[:-4]
        df_all = df_all.append(df, ignore_index=True)
    return df_all

con = sqlite3.connect('QC.sqlite')
df = get_fov_meta('ALL_FOV_LOCATIONS.csv')
df_all = get_annotation_elements('QC/csv')
df.to_sql('fov_meta', con, if_exists='replace', index=False)
df_all.to_sql('annotation_elements', con, if_exists='replace', index=False)
kheffah commented 3 years ago

@Amshoreline @player1321 Please use the following links to access the sqlite database:

Important note: this is RAW data! The csv files are better suited for use, but feel free to use the raw data if you have a very strong preference for sqlite.

Amshoreline commented 3 years ago

@Amshoreline @player1321 Please use the following links to access the sqlite database:

Important note: this is RAW data! The csv files are better suited for use, but feel free to use the raw data if you have a very strong preference for sqlite.

Thanks, but I encountered a new problem: NuCLS/nucls_model/DataLoadingUtils.py", line 539, in getitem boxes=np.int32(target['boxes'])) NuCLS/nucls_model/DataFormattingUtils.py", line 67, in from_dense_to_sparse_object_mask obj_ids = dense_mask[ys[:, 0], xs[:, 0]] IndexError: too many indices for array

chenqz1998 commented 3 years ago

I'm still not sure how to use your code. Could you please explain that?