Add change from signed int to unsigned int upon copying data

martinschorb commented 2 years ago

I don't know if you want to test for this in all potential output formats, or if we simply trust numpy on doing it right.

martinschorb commented 2 years ago

We should also add a test for this, I think the best place is https://github.com/constantinpape/cluster_tools/blob/master/test/downscaling/test_downscaling.py.

Do you want to add signed test data to the archive, or would you prefer generating it?

My suggestion would be to use the hdf5 example, simply changing its dtype and shifting its values.

constantinpape commented 2 years ago

My suggestion would be to use the hdf5 example, simply changing its dtype and shifting its values.

Yes, I agree that sounds like the best solution.

martinschorb commented 2 years ago

When I try to run an import it also fails for me.

import_data/utils.py:102, in downscale(in_path, in_key, out_path, resolution, scale_factors, chunks, tmp_folder, target, max_jobs, block_shape, library, library_kwargs, metadata_format, out_key, unit, source_name, roi_begin, roi_end, int_to_uint)
    100 ret = luigi.build([t], local_scheduler=True)
    101 if not ret:
--> 102     raise RuntimeError("Downscaling failed")

Is there a way to get a more detailed idea what exactly went wrong?

constantinpape commented 2 years ago

It's in the test log:

 ERROR: [pid 2798] Worker Worker(salt=763965705, workers=1, host=fv-az196-745, username=runner, pid=2798) failed    CopyVolumeLocal(tmp_folder=./tmp, max_jobs=2, config_dir=./tmp/config, input_path=/home/runner/work/cluster_tools/cluster_tools/test_data/sampleA.n5, input_key=volumes/raw/s0, output_path=./tmp/data.n5, output_key=data/s0, prefix=initial_scale, dtype=None, int_to_uint=False, fit_to_roi=False, effective_scale_factor=[], dimension_separator=None, dependency=DummyTask)
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/cluster_env/lib/python3.8/site-packages/luigi/worker.py", line 191, in run
    new_deps = self._run_get_new_deps()
  File "/usr/share/miniconda/envs/cluster_env/lib/python3.8/site-packages/luigi/worker.py", line 133, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/runner/work/cluster_tools/cluster_tools/cluster_tools/cluster_tasks.py", line 95, in run
    raise e
  File "/home/runner/work/cluster_tools/cluster_tools/cluster_tools/cluster_tasks.py", line 81, in run
    self.run_impl()
  File "/home/runner/work/cluster_tools/cluster_tools/cluster_tools/copy_volume/copy_volume.py", line 119, in run_impl
    dtype = DTYPE_MAPPING.get(dtype, dtype)
UnboundLocalError: local variable 'dtype' referenced before assignment

martinschorb commented 2 years ago

OK. Runs now. Now to the tests...

a check that makes sure that the dtype argument (which is available through self.dtype) is compatible with the resulting unsigned dtype if self.dtype is not None and int_to_uint is True

As it is now, the dtype argument is simply ignored when int_to_uint is requested (dtype gets overwritten by the unsigned data dtype).

Do you like a warning issued if both arguments are given?

martinschorb commented 2 years ago

If we assume that the dtype provided by elf.io.open_file is a numpy.dtype The check for this being signed int should be sufficent.

constantinpape commented 2 years ago

If we assume that the dtype provided by elf.io.open_file is a numpy.dtype The check for this being signed int should be sufficent.

No this is more complex than you think (and also than I initially thought). See line 119. If dtype is passed as an argument none of this will make sense. Please add a check in the beginning that raises an error if self.int_to_uint is True and self.dtype is not None

martinschorb commented 2 years ago

Doesn't in the end https://github.com/martinschorb/cluster_tools/blob/c05642dba9ec1ebc7b534c7edb03034ba8917eda/cluster_tools/copy_volume/copy_volume.py#L125 take over everthing before, and hence only ds.dtype which is coming from elf is relevant for being checked?

constantinpape commented 2 years ago

Doesn't in the end https://github.com/martinschorb/cluster_tools/blob/c05642dba9ec1ebc7b534c7edb03034ba8917eda/cluster_tools/copy_volume/copy_volume.py#L125 take over everthing before, and hence only ds.dtype which is coming from elf is relevant for being checked?

I overlooked this, but what you did there was not quite correct. You need to use dtype instead of ds.dtype there (I updated this already). The reason is that there are no guarantees that the dtype will be string represented as e.g.int32, it could also be <i4. elf doesn't implement any special checks for this it just returns the .dtype of the underlying storage class, e.g. the h5py dataset, zarr array or z5py dataset. That's the whole reason for the dtype mapping happening here (which could even be more complete and there is hopefully a better way to do this I could adopt eventually, but that's out of scope now).

constantinpape commented 2 years ago

I went ahead and fixed the remaining issues (there were a couple. I can just encourage you again to use a linter: it helps tremendously to write correct python code because you will find many issues without needing to actually run the code) and added the test.

Also, I think that the code you used to translate between dtypes was wrong: It should be

(data-np.iinfo(data.dtype).min).astype(dtype)

instead of

(data-np.iinfo(data.dtype).min-1).astype(dtype)

see the screenshots below:

int-to-uint-incorrect

int-to-uint-correct

constantinpape commented 2 years ago

I ll go ahead and merge this.

martinschorb commented 2 years ago

Yes, for some reason i assumed that the minimum of int8 was -127, hence I added the offset.

constantinpape / cluster_tools

Add change from signed int to unsigned int upon copying data #33