Add script to run DeepCell with QuPath conventions

dchaley commented 2 months ago

From Brenna:

The QuPath convention is to structure a "dataset" as folder. The input images (OME-TIFF) are in one folder; the output segmentation mask is a TIFF in another folder.

In between, we need to convert the OME-TIFFs to NPZ. This is the process we did in workbooks like this one: Extract-Sample_preview-human-breast-20221103-418mb.ipynb

For our purposes, the implication is we need a helper script to:

receive a dataset root folder
receive a npz file (already generated) in the NPZ_INTERMEDIATE.
output a tiff segmentation mask in the SEGMASK folder.

Our current helper script generates a random folder in a base directory, to contain the job's intermediate steps & output. We'll need a place to put the intermediate files; possibly just using the dataset as a base with a timestamped run. (?)

dchaley commented 2 months ago

This could be a command-line script with the following interface:

deepcell-qupath.py
  --dataset gs://project/data/somedataset
  --prefix ROI01
  --compartment <whole_cell | nucleus>

This would read from: gs://project/data/somedataset/NPZ_INTERMEDIATE/<prefix>.npz

And it would write to: gs://project/data/somedataset/SEGMASK/<prefix>_<compartment>.npz

And it would put its intermediate files here: gs://project/data/somedataset/jobs/<datetime.now>_<prefix>/

The script would work by either wrapping run-multistep-job (which would need to parameterize the folders more), or perhaps better, extract the part that just takes files, and have 2 entrypoints depending on the convention (one for qupath, and one that generates a job id directory like we've been doing).

Question: is "prefix" the right word here? Are these always ROIs? What should we call the portion being analyzed?

cc @bnovotny what do you think?

bnovotny commented 1 month ago

Thanks @dchaley! I think this would work! Then we can just set up the script to loop through and submit all the images in a dataset as long as they go back to the same bucket.

I think "prefix" is fine, since the images might be ROIs, whole slides, TMA cores, etc. Let me know if you need any more info!

dchaley commented 1 month ago

This is complete! @bnovotny (cc @lynnlangit)

Check out batch/run-qupath-job.py

This sets up parameters following the qupath conventions discussed here.

Example invocation:

./run-qupath-job.py --dataset "gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset" --prefix ROI01 --compartment whole-cell

After running it twice, this is what we get:

$ gsutil ls -r gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/:

gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/NPZ_INTERMEDIATE/:
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/NPZ_INTERMEDIATE/ROI01.npz

gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/SEGMASK/:
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/SEGMASK/ROI01_whole-cell.tiff

gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/:

gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/:
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/postprocess_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/prediction_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/predictions.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/preprocess_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/preprocessed.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/raw_predictions.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/visualized_input.png
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:27:44.934695/visualized_predictions.png

gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/:
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/postprocess_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/prediction_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/predictions.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/preprocess_benchmark.json
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/preprocessed.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/raw_predictions.npz.gz
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/visualized_input.png
gs://deepcell-batch-jobs_us-central1/job-runs/tmp-pipeline/qupath_dataset/jobs/ROI01_2024-07-16T14:39:14.518835/visualized_predictions.png

One thought is that you were modifying the json, this moves the json to a helper function (so it can be shared with the non-qupath version). Is that a problem? Are there more things you need to parameterize that we could just include here? I think you have network/subnet for example. What about machine type?

Maybe, we can pull the JSON template out to a file or something.

Would be curious what your experience is running this, and we can improve accordingly!

bnovotny commented 1 month ago

Hi @dchaley, apologies, just getting back to this. This is the chunk, with some context on either end, that I had to add to the JSON (of course with the service account, network, and subnetwork filled in). I haven't changed the machine type, accelerators, etc., but I'm thinking we may need to change them in the future when we work on larger images. Thanks!

"location": {{
      "allowedLocations": [
        "regions/{region}"
      ]
    }},
    "serviceAccount": {{
      "email": ""
    }},
    "network": {{
      "networkInterfaces": [
        {{
          "network": "",
          "subnetwork": "",
          "noExternalIpAddress": true
        }}
      ]
    }}
  }},
  "logsPolicy": {{
    "destination": "CLOUD_LOGGING"
  }}

dchaley commented 1 month ago

Great, thank you @bnovotny ! We're going to tackle that in #284. It should make life a lot easier.

For now, let's close out this issue – if anything goes wrong running with this QuPath mode please comment here and/or open a new issue!

dchaley / deepcell-imaging

Add script to run DeepCell with QuPath conventions #281