Reproducing results on zebrafinch data

mdraw commented 1 year ago

I am currently trying to reproduce the MTLSD results on the zebrafinch dataset.

The dataset itself can be downloaded succesfully using the code in lsd_data_download.ipynb but I could not find any JSON config file, model checkpoint or zebrafinch-specific training or prediction script. For the fib25 dataset there is some dataset-specific code included in the GitHub repository which I have tried to modify for zebrafinch data but was not successful yet. The large-scale prediction scripts seem to expect a certain directory structure as indicated here: https://github.com/funkelab/lsd/blob/master/lsd/tutorial/scripts/01_predict_blockwise.py#L47-L59 - which does not seem to be included in the public code and data repositories that I have found until now. Would you kindly share these zebrafinch-related files if that is possible?

I would also like to ask if there is a PyTorch-based version of the whole training and prediction workflow available somewhere and if there have been any updates on the Singularity image. I am asking because the new tutorials rely on PyTorch but the public 3D prediction-related code and the Singularity image only rely on TensorFlow (related: #6).

sheridana commented 1 year ago

Hi @mdraw yes unfortunately it is a bit tricky to use the old singularity container + tensorflow scripts because of deprecated cuda versions that don't play well with updated drivers. I am working on getting a new singularity container working with the old scripts and putting together a tutorial for all datasets from the paper (including upload of all relevant checkpoints to the aws bucket). I hope to have this done within the next couple weeks. Thank you for your patience!

mdraw commented 1 year ago

Thank you @sheridana, I appreciate that!

sheridana commented 1 year ago

Hey @mdraw here is a repo showing how to run zfinch nets with pretrained checkpoints and singularity containers. Still a work in progress as I need to add the fibsem nets, but should be a good starting point already. Let me know if you run into any problems!

mdraw commented 1 year ago

Sounds great, but I'm currently getting a 404 on the address you linked, also when logged in. Is it a private repo that requires invitation?

sheridana commented 1 year ago

Yeah sorry just fixing some stuff, should hopefully be done soon!

sheridana commented 1 year ago

@mdraw should be good now

mdraw commented 1 year ago

Dear @sheridana, thanks again for the additional resources that you have provided in https://github.com/funkelab/lsd_nm_experiments. They have helped me a lot in running my own LSD training experiments with the zebrafinch cubes and doing evaluations on a custom set of small validation cubes.

However, I also wanted to run the full evaluation on the same test and validation data as the paper ("benchmark region") so I can meaningfully compare between different methods and I still have some open questions about this:

The zarr store funke/zebrafinch/testing/ground_truth/data.zarr/ in the s3 bucket only contains a neuropil_mask dataset, no raw data. For hemibrain and fib25 the corresponding file in the ground_truth folder also contain raw data. According to the https://github.com/funkelab/lsd/blob/master/lsd/tutorial/notebooks/lsd_data_download.ipynb notebook the full raw data for testing is available from the google bucket (via CloudVolume in in xyz voxel space), but how do you feed this data into the block prediction pipeline exactly? Do you have a local copy of the complete dataset in zarr format or do you load chunks on-demand from the google bucket using CloudVolume? In the latter case, do you use a special kind of source request for the gunpowder pipeline?
Can you please share the config.json files for the parallel processing pipeline? The readme in the lsd repository has hints on how they could look like but I can't find any configs that are needed for reproducing zebrafinch results. The mknet.py files in the lsd_nm_experiments repo produce config.json files but those are just suitable for training, not inference. It looks like each of the scripts in https://github.com/funkelab/lsd/tree/master/lsd/tutorial/scripts should be called with a dataset-specific config file as an argument.

I know that testing additional code and configuration and adapting to code/environment changes can be quite time-consuming and annoying so I really don't expect you to this before sharing - you can just share original / currently untested files from the evaluation experiments (if they are still available) and I will test them and report back here.

mdraw commented 1 year ago

@sheridana, is there any news on this? I am still not sure how to set everything up for a fair comparison to the LSD baselines. If providing the original config files is not possible, can you comment on if you remember any differences in the config settings between the example config dicts in the README.md and the zebrafinch configurations? It would already help me a lot to know if the non-obvious settings such as thresholds, context windows, block sizes etc. were the same for all datasets and resemble the "example" sections in the README.md.

sheridana commented 1 year ago

Hi @mdraw, sorry have been very saturated.

We stored a local copy of the raw data as a zarr. This was created by downloading the data using CloudVolume but parallelized using Daisy. You could do similar with Dask, but essentially the block task would read the data in a given roi from the bucket and write it out to the same block in the zarr container. E.g you would get your cloud volume and create a zarr container:

cloud_vol = CloudVolume("bucket_url")
out_container = zarr.open("path/to/zarr", "a")

Then your main function would get the total roi of the volume using cloud volume, something like:

size = cloud_vol.info['scales'][0]['size'][::-1]

which might be outdated by now, but calling info on the volume should give you relevant metadata. But you need the total roi of the volume so daisy/dask knows how to tile over it. If using zarr with daisy, we assume nanometers zyx (hence the flip above since cloud volume stores xyz). I think cloud volume also stores in voxel space so you would then need to scale that by the voxel size.

Then the block function would take the volumes along with a block (which would be created by daisy/dask):

def write_to_block(cloud_vol, out_container, block):

      # load data from cloud volume into numpy array inside block roi
      data = cloud_vol[*block.roi]

     # write data out do zarr container inside block roi
     out_container["raw"][*block.roi] = data

The main process would then handle the distribution across blocks.

You could also definitely handle this with a custom gunpowder node to perform on the fly fetching. I did do this a while back, but you'd need to be extra careful about controlling number of requests to the cloud volume as it can rapidly incur a cost.

Yes we controlled post-processing via configs and these are pretty specific to our setup with the janelia cluster. For inference, here is an example mtlsd config for the zebrafinch:

{
  "experiment": "zebrafinch",
  "setup" : "setup02",
  "iteration" : 400000,
  "raw_file" : "path/to/container.json",
  "raw_dataset" : "volumes/raw",
  "out_file" : "path/to/store/data",
  "file_name": "zebrafinch.zarr",
  "num_workers": 60,
  "db_host": "your_mongodb_host",
  "db_name": "your_database_name",
  "queue": "gpu_rtx"
}

where container.json specifies the benchmark roi to use for the raw dataset in the zarr container (sizes are in nm, zyx):

{
  "container": "path/to/raw.zarr",
  "offset": [4000, 7200, 4500], 
  "size": [106000, 83700, 87300]
}

Notice here that there is no explicit block size as it is instead determined by the network input/output shape as created in the mknet.py files. See here. The interface between gunpowder and daisy is handled by a DaisyRequestBlocks node.

For watershed:

{
  "experiment": "zebrafinch",
  "setup": "setup02",
  "iteration": 400000,
  "affs_file": "path/to/affs.zarr",
  "affs_dataset": "/volumes/affs",
  "fragments_file": "path/to/fragments.zarr",
  "fragments_dataset": "/volumes/fragments",
  "block_size": [3600, 3600, 3600],
  "context": [240, 243, 243],
  "db_host": "your_mongodb_host",
  "db_name": "your_database_name",
  "num_workers": 100,
  "fragments_in_xy": true,
  "epsilon_agglomerate": 0.1,
  "mask_file": "path/to/mask.zarr",
  "mask_dataset": "volumes/mask",
  "queue": "normal",
  "filter_fragments": 0.05
}

same logic for agglomeration, extracting segmentation, etc. Main difference here is that the block size and context are explicitly handled. The block size should be chosen similarly to how you choose chunk size when creating a zarr dataset: you need to consider things like storage, I/O, network file system, access frequency, etc. Too small of a block size will create many more files to deal with in the zarr container, too large of a block size will slow down processing within a block. Probably just want more than 64 voxels and less than 512 in each dim. Here we set this in nanometers, and it needs to be divisible by voxel size (since daisy functions in world space rather than voxel space). So [3600]*3 / [20,9,9] = [180,400,400] which is pretty consistent with the network size in this case. For the context, this just needs to be a few extra pixels for the read roi. Somewhere between 10 and 30 voxels, just needs to be a multiple of the voxel size.

I can still send the config files if that helps? Although they are all pretty consistent, just different network/data paths and databases. Also, a lot of the logic in these scripts is very specific to the janelia cluster and file system. Additionally, a lot of the code is outdated, e.g daisy has undergone a large refactor and persistence related code was moved here. Let me know if you want to discuss further at some point, might be easier via zoom.

PG-Gu commented 1 year ago

Dear @sheridana ,

Due to some techinical issues from my side, I am not able to create affinity graphs currently.

I wonder if you mind sharing some smaller crops of affinity graphs that you used in the experiments which would be immensely helpful for me to test out my method creating some segmentations.

Thanks for your contributions and information you kindly provided in this thread.

sheridana commented 1 year ago

@PG-Gu here are affs from autocontext net inside ~11 micron zfinch roi. The offset is in the zarr attrs file (nm, zyx).

ddd9898 commented 6 months ago

Hi @mdraw , glad to find someone with a similar question! Have you reproduced the evaluation results on zebrafinch data so far? I am now confused as to how the manually traced skeletons stored in the s3 bucket funke/zebrafinch/testing/ground_truth/testing/consolidated/zebrafinch_gt_skeletons_new_gt_9_9_20_testing are aligned to the pixels in RoIs?

Also, the last example config dict in the README.md provided the location of 11_micron_roi. Did you find the offset info for the other RoIs (e.g., 18, 25, 32, 40...)?

It would be very nice of you to share your progress over the past several months.

funkelab / lsd

Reproducing results on zebrafinch data #20