SBU-BMI / wsinfer

🔥 🚀 Blazingly fast pipeline for patch-based classification in whole slide images
https://wsinfer.readthedocs.io
Apache License 2.0
55 stars 9 forks source link

always create geojson outputs with 'wsinfer run' #181

Closed kaczmarj closed 10 months ago

kaczmarj commented 11 months ago

let's create geojson files at the end of each wsinfer run call. to do this, a few things will have to be done:

@swaradgat19 - would you like to tackle this?

swaradgat19 commented 11 months ago

Sure! I'll work on this

swaradgat19 commented 11 months ago

@kaczmarj - I was able to generate the geojson files using the wsinfer run command by utilizing the functions defined in convert_csv_to_geojson.py in the run function in infer.py.

I wanted to ask if it'll be appropriate to delete the following lines of code in infer.py:

if not results_dir.exists():
        raise click.ClickException(f"results_dir does not exist: {results_dir}")

if (
        not (results_dir / "model-outputs").exists()
        and (results_dir / "patches").exists()
    ):
        raise click.ClickException(
            "Model outputs have not been generated yet. Please run model inference."
        )

if not (results_dir / "model-outputs").exists():
        raise click.ClickException(
            "Expected results_dir to contain a 'model-outputs' directory but it does"
            " not. Please provide the path to the directory that contains"
            " model-outputs, masks, and patches."
        )

Since we're running this inside the run function in infer.py, it is certain that results_dir will exist, although I couldn't see where results_dir is being made ( something like results_dir.mkdir() ).

Should I remove this piece of code, since results_dir/"model-outputs", results_dir/"patches" are guaranteed to exist?.

kaczmarj commented 11 months ago

good question. i would say it depends what the function does. if the function can also be called on its own and requires model-outputs and patches to exist, then you should keep those checks. i would say just keep them.

but change the exception classes to FileNotFoundError

swaradgat19 commented 11 months ago

@kaczmarj I'm able to generate the geojson files along with the csv files when I run the wsinfer run command and I'm storing them in the Results/ folder. The results folder looks like this after the run command:

masks model-outputs-csv model-outputs-geojson patches run_metadata_20230811T143002.json

I'm getting a new run_metadata file everytime I use the command. Just wanted to confirm if that's how it's supposed to work. I'm yet to parallelize the geojson conversion.

kaczmarj commented 11 months ago

looks great! the results folder has the correct contents.

I'm getting a new run_metadata file everytime I use the command. Just wanted to confirm if that's how it's supposed to work. I'm yet to parallelize the geojson conversion.

yes, that's intentional. a user might re-run analysis on some slides for a number of reasons, and instead of rewriting the metadata json file every time, it seemed reasonable to write one for each run.

swaradgat19 commented 11 months ago

Great. Could you tell me how I should proceed with the last point where we are utilizing multiple cores? I was reading up on that. Should we use the multiprocessing library?

kaczmarj commented 11 months ago

yes the multiprocessing library would work. i pasted a small example below. but instead, let's use tqdm.contrib.concurrent.process_map https://tqdm.github.io/docs/contrib.concurrent/#process_map because that will automatically show a progress bar too

from multiprocessing import Pool

def make_geojson(results_for_one_slide):
    pass
    # write geojson here....

with Pool(num_workers) as p:
    p.imap_unordered(make_geojson, [result1, result2, result3], chunksize=8)

there are other options too.

  1. concurrent.futures from the python standard library. if you use the ProcessPoolExecutor, it is very similar to using the multiprocessing library.
  2. the process_map function from tqdm.contrib.concurrent. tqdm is a very nice library for showing progress bars. we will probably want to show a progress bar during this process, so actually let's use this process_map function! i highly suggest you look at the source to see how it works. it uses concurrent.futures https://tqdm.github.io/docs/contrib.concurrent/#process_map
swaradgat19 commented 11 months ago

Got it.

Right now, I've simply added the relevant code from wsinfer/write_geojson.py to the run function in infer.py. So wsinfer/write_geojson.py contains the helper functions like _box_to_polygon, _row_to_geojson, _dataframe_to_geojson and convert and I'm accessing them in the run function.

I'll create a different function in infer.py and then do the conversion process using the options you've suggested

kaczmarj commented 11 months ago

i would suggest creating a function in wsinfer/write_geojson.py that can be used to write a single geojson file for the output of one slide. then implement another function in that file that applies that function in parallel (using tqdm for example) to multiple files. this function would call the function that makes a single geojson file but does that across all of the files.

then in infer.py, use the function from wsinfer/write_geojson.py that writes all of the geojson files in parallel.

swaradgat19 commented 11 months ago

Yes, that makes sense. Working on this

swaradgat19 commented 11 months ago

@kaczmarj

i would suggest creating a function in wsinfer/write_geojson.py that can be used to write a single geojson file for the output of one slide

Correct me if I'm wrong but this is being done by the convert function (previously in convert_csv_to_geojson.py) right?

I'm looping over all CSVs in write_geojson.py and calling the convert function every time. We would need to parallelize this function itself if I'm not wrong

swaradgat19 commented 11 months ago

So i've written this code in write_geojson:

def make_geojson(csv: str) -> None:
    df = pd.read_csv(csv)
    prob_cols = [col for col in df.columns.tolist() if col.startswith("prob_")]
    if not prob_cols:
        raise click.ClickException("Did not find any columns with prob_ prefix.")
    geojson = _dataframe_to_geojson(df, prob_cols)
    with open(f"""Results/model-outputs-geojson/{csv.name.split(".")[0]}.json""", "w") as f:
        json.dump(geojson, f)

def parallelize(csvs: list, results_dir: Path, num_workers: int) -> None:

    output = results_dir/"model-outputs-geojson"

    if not results_dir.exists():
            raise click.ClickException(f"results_dir does not exist: {results_dir}")
    if output.exists():
        raise click.ClickException("Output directory already exists.")
    if (
        not (results_dir / "model-outputs-csv").exists()
        and (results_dir / "patches").exists()
    ):
        raise click.ClickException(
            "Model outputs have not been generated yet. Please run model inference."
        )
    if not (results_dir / "model-outputs-csv").exists():
        raise click.ClickException(
            "Expected results_dir to contain a 'model-outputs' directory but it does"
            " not. Please provide the path to the directory that contains"
            " model-outputs, masks, and patches."
        )

    output.mkdir(parents = True,exist_ok = False)

    click.secho("Converting csv files to geojson files", fg="green")

    # for csv in tqdm.tqdm(csvs):
    #     convert(input = csv, output = f"""Results/model-outputs-geojson/{csv.name.split(".")[0]}.json""")

    process_map(make_geojson, csvs, max_workers = num_workers)

    click.secho("Finished.", fg="green")

It is able to convert the csvs to geojson files. I'm just not sure whether we're getting any performance optimizations. I'm converting 3 csvs to JSON files and I'm getting these results:

Finished.
Converting csv files to geojson files
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.46s/it]
kaczmarj commented 11 months ago

Correct me if I'm wrong but this is being done by the convert function (previously in convert_csv_to_geojson.py) right?

oh yes, you are correct!

I'm looping over all CSVs in write_geojson.py and calling the convert function every time. We would need to parallelize this function itself if I'm not wrong

yes, that's correct.

So i've written this code in write_geojson

looks good overall. the make_geojson function should take the results directory as an input. and replace the click.ClickException errors with FileExistsError or something similar. in output.mkdir, exist_ok should probably be True. and you can remove the click.secho function.

swaradgat19 commented 11 months ago

Did the changes. The only issue with make_geojson function having the extra argument is that process_map doesn't seem to support multiple arguments. I tried passing results_dir through it like this:

process_map(make_geojson, csvs, results_dir, max_workers = num_workers)

But it gives me an error saying that results_dir is not an iterable. I guess it only takes iterables as arguments. I'm trying to figure out a way of passing them through functools.

One way is explicitly specifying it:

def make_geojson(csv: str) -> None:

    results_dir = "Results/model-outputs-geojson"
    # Rest of the code

This works

swaradgat19 commented 11 months ago
    func = partial(make_geojson, results_dir)
    process_map(func, csvs, max_workers = num_workers)

So this seems to be working. Should I open a PR for this?

kaczmarj commented 11 months ago

fantastic, yes please submit a pr for this.

swaradgat19 commented 11 months ago

I pushed the changes to my main branch, but it failed all tests. I guess we'll have to update the tests as well. I've opened a PR, though

kaczmarj commented 11 months ago

yes the tests will definitely have to be updated to use the new code