Closed kaczmarj closed 10 months ago
Sure! I'll work on this
@kaczmarj - I was able to generate the geojson files using the wsinfer run
command by utilizing the functions defined in convert_csv_to_geojson.py
in the run
function in infer.py
.
I wanted to ask if it'll be appropriate to delete the following lines of code in infer.py
:
if not results_dir.exists():
raise click.ClickException(f"results_dir does not exist: {results_dir}")
if (
not (results_dir / "model-outputs").exists()
and (results_dir / "patches").exists()
):
raise click.ClickException(
"Model outputs have not been generated yet. Please run model inference."
)
if not (results_dir / "model-outputs").exists():
raise click.ClickException(
"Expected results_dir to contain a 'model-outputs' directory but it does"
" not. Please provide the path to the directory that contains"
" model-outputs, masks, and patches."
)
Since we're running this inside the run function in infer.py
, it is certain that results_dir
will exist, although I couldn't see where results_dir is being made ( something like results_dir.mkdir()
).
Should I remove this piece of code, since results_dir/"model-outputs"
, results_dir/"patches"
are guaranteed to exist?.
good question. i would say it depends what the function does. if the function can also be called on its own and requires model-outputs
and patches
to exist, then you should keep those checks. i would say just keep them.
but change the exception classes to FileNotFoundError
@kaczmarj I'm able to generate the geojson files along with the csv files when I run the wsinfer run
command and I'm storing them in the Results/
folder. The results folder looks like this after the run command:
masks model-outputs-csv model-outputs-geojson patches run_metadata_20230811T143002.json
I'm getting a new run_metadata
file everytime I use the command. Just wanted to confirm if that's how it's supposed to work. I'm yet to parallelize the geojson conversion.
looks great! the results folder has the correct contents.
I'm getting a new run_metadata file everytime I use the command. Just wanted to confirm if that's how it's supposed to work. I'm yet to parallelize the geojson conversion.
yes, that's intentional. a user might re-run analysis on some slides for a number of reasons, and instead of rewriting the metadata json file every time, it seemed reasonable to write one for each run.
Great. Could you tell me how I should proceed with the last point where we are utilizing multiple cores? I was reading up on that. Should we use the multiprocessing
library?
yes the multiprocessing library would work. i pasted a small example below. but instead, let's use tqdm.contrib.concurrent.process_map
https://tqdm.github.io/docs/contrib.concurrent/#process_map because that will automatically show a progress bar too
from multiprocessing import Pool
def make_geojson(results_for_one_slide):
pass
# write geojson here....
with Pool(num_workers) as p:
p.imap_unordered(make_geojson, [result1, result2, result3], chunksize=8)
there are other options too.
process_map
function from tqdm.contrib.concurrent
. tqdm is a very nice library for showing progress bars. we will probably want to show a progress bar during this process, so actually let's use this process_map
function! i highly suggest you look at the source to see how it works. it uses concurrent.futures
https://tqdm.github.io/docs/contrib.concurrent/#process_mapGot it.
Right now, I've simply added the relevant code from wsinfer/write_geojson.py
to the run function in infer.py
. So wsinfer/write_geojson.py
contains the helper functions like _box_to_polygon
, _row_to_geojson
, _dataframe_to_geojson
and convert
and I'm accessing them in the run function.
I'll create a different function in infer.py
and then do the conversion process using the options you've suggested
i would suggest creating a function in wsinfer/write_geojson.py
that can be used to write a single geojson file for the output of one slide. then implement another function in that file that applies that function in parallel (using tqdm for example) to multiple files. this function would call the function that makes a single geojson file but does that across all of the files.
then in infer.py
, use the function from wsinfer/write_geojson.py
that writes all of the geojson files in parallel.
Yes, that makes sense. Working on this
@kaczmarj
i would suggest creating a function in wsinfer/write_geojson.py that can be used to write a single geojson file for the output of one slide
Correct me if I'm wrong but this is being done by the convert
function (previously in convert_csv_to_geojson.py
) right?
I'm looping over all CSVs in write_geojson.py
and calling the convert
function every time. We would need to parallelize this function itself if I'm not wrong
So i've written this code in write_geojson
:
def make_geojson(csv: str) -> None:
df = pd.read_csv(csv)
prob_cols = [col for col in df.columns.tolist() if col.startswith("prob_")]
if not prob_cols:
raise click.ClickException("Did not find any columns with prob_ prefix.")
geojson = _dataframe_to_geojson(df, prob_cols)
with open(f"""Results/model-outputs-geojson/{csv.name.split(".")[0]}.json""", "w") as f:
json.dump(geojson, f)
def parallelize(csvs: list, results_dir: Path, num_workers: int) -> None:
output = results_dir/"model-outputs-geojson"
if not results_dir.exists():
raise click.ClickException(f"results_dir does not exist: {results_dir}")
if output.exists():
raise click.ClickException("Output directory already exists.")
if (
not (results_dir / "model-outputs-csv").exists()
and (results_dir / "patches").exists()
):
raise click.ClickException(
"Model outputs have not been generated yet. Please run model inference."
)
if not (results_dir / "model-outputs-csv").exists():
raise click.ClickException(
"Expected results_dir to contain a 'model-outputs' directory but it does"
" not. Please provide the path to the directory that contains"
" model-outputs, masks, and patches."
)
output.mkdir(parents = True,exist_ok = False)
click.secho("Converting csv files to geojson files", fg="green")
# for csv in tqdm.tqdm(csvs):
# convert(input = csv, output = f"""Results/model-outputs-geojson/{csv.name.split(".")[0]}.json""")
process_map(make_geojson, csvs, max_workers = num_workers)
click.secho("Finished.", fg="green")
It is able to convert the csvs to geojson files. I'm just not sure whether we're getting any performance optimizations. I'm converting 3 csvs to JSON files and I'm getting these results:
Finished.
Converting csv files to geojson files
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.46s/it]
Correct me if I'm wrong but this is being done by the convert function (previously in convert_csv_to_geojson.py) right?
oh yes, you are correct!
I'm looping over all CSVs in write_geojson.py and calling the convert function every time. We would need to parallelize this function itself if I'm not wrong
yes, that's correct.
So i've written this code in write_geojson
looks good overall. the make_geojson
function should take the results directory as an input. and replace the click.ClickException
errors with FileExistsError or something similar. in output.mkdir
, exist_ok
should probably be True. and you can remove the click.secho
function.
Did the changes. The only issue with make_geojson
function having the extra argument is that process_map
doesn't seem to support multiple arguments. I tried passing results_dir
through it like this:
process_map(make_geojson, csvs, results_dir, max_workers = num_workers)
But it gives me an error saying that results_dir is not an iterable. I guess it only takes iterables as arguments. I'm trying to figure out a way of passing them through functools
.
One way is explicitly specifying it:
def make_geojson(csv: str) -> None:
results_dir = "Results/model-outputs-geojson"
# Rest of the code
This works
func = partial(make_geojson, results_dir)
process_map(func, csvs, max_workers = num_workers)
So this seems to be working. Should I open a PR for this?
fantastic, yes please submit a pr for this.
I pushed the changes to my main branch, but it failed all tests. I guess we'll have to update the tests as well. I've opened a PR, though
yes the tests will definitely have to be updated to use the new code
let's create geojson files at the end of each
wsinfer run
call. to do this, a few things will have to be done:wsinfer/write_geojson.py
. we should use write_geojson.py instead of geojson.py so we don't have a name collision with the third-partygeojson
library.wsinfer run
to create geojson files at the end of the run. use the methods fromwsinfer/write_geojson.py
RESULTS_DIR/model-outputs-geojson
RESULTS_DIR/model-outputs
toRESULTS_DIR/model-outputs-csv
--num-workers
argument towsinfer run
). the geojson can take some time and we would speed it up tremendously by using multiple cores.@swaradgat19 - would you like to tackle this?