Create a TensorFlow Dataset dataset with downloaded data

a-desmons commented 2 years ago

@EiffL I'm just a little confused about the first cell of the "Contrastive Learning on HSC images" notebook.

# Retrieving pre-prepared data, it takes 2 minutes.
!gsutil -m -q cp -r gs://ahw2019/hsc_photoz/tensorflow_datasets /root/

Do I need to run it when using my own data? What is it retrieving exactly? I thought the cutout file and catalogue file were retrieved in cell 5 by running the hsc_photoz cloned from github:

%cd Tutorials/PhotozCNN
import tensorflow_datasets as tfds
import hsc_photoz

hsc_dset = tfds.load('hsc_photoz', split='train')

a-desmons commented 2 years ago

Never mind, I think I figured it out :)

EiffL commented 2 years ago

:-) So yeah, that line is just to retrieve an already "compiled" dataset from google cloud storage to the local colab instance.

So, does that mean you successfuly compiled a dataset and found a way to upload it to colab?

a-desmons commented 2 years ago

Yep! I think everything is working fine, I'm just playing around with the cutout sizes at the moment and trying to find a good size.

Also, I managed to get the HSC bulk cutout task working by running the photoz_inference_data_preparation.ipynb document in your Tutorials -> PhotozCNN repo and running it through Colab. What I did before was copy the code into a new script and try to run it locally which didn't work, still not sure why, but at least it works now :)

a-desmons commented 2 years ago

@EiffL Are the object IDs stored anywhere in the TensorFlow Dataset? And if not, is there a reason why? (i.e. can I add them to the attributes?)

EiffL commented 2 years ago

Hi @alidez , it depends on how you created the dataset but yeah you can store them, and that's a good idea to do so.

Could you push to this repo your code? This way I can tell you exactly where that would happen.

a-desmons commented 2 years ago

I added my hsc_photoz code to the repo, I'm assuming I just add an 'object_id' attribute to the list of attributes in the hsc_photoz.py file?

I also added my copy of the Self-Supervised-Example Colab notebook if you want to see how everything is working. I'm currently working on assembling a sample of interesting galaxies to see what kind of similar galaxies the code spits out :)

EiffL commented 2 years ago

Yes that's exactly right, just a matter of adding object_id to the list of attributes ;-)

a-desmons commented 2 years ago

Hi @EiffL, I added object_id to the list of attributes but it gets saved in float format: i.e 3.748525e+16 instead of 37485259083763257 which causes a bunch of objects to have the same object_id. Is there a way I can save the object_id attribute as an int?

I'm guessing it's somewhere in here in the hsc_photoz.py file, but I'm not sure where.

  def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # TODO(hsc_photoz): Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
    'image': tfds.features.Tensor(shape=(128, 128, 5), dtype=tf.float32),
    'attrs': {k: tf.float32 for k in _attrs}
        }),
        # If there's a common (input, target) tuple from the
        # features, specify them here. They'll be used if
        # `as_supervised=True` in `builder.as_dataset`.
        supervised_keys=('image', 'attrs/specz_redshift'),
        homepage='https://dataset-homepage/',
        citation=_CITATION,

a-desmons commented 2 years ago

So I tried to define the object_id outside of the list of attributes like so:

 def _info(self) -> tfds.core.DatasetInfo:
    """Returns the dataset metadata."""
    # TODO(hsc_photoz): Specifies the tfds.core.DatasetInfo object
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
    'image': tfds.features.Tensor(shape=(128, 128, 5), dtype=tf.float32),
    'attrs': {k: tf.float32 for k in _attrs},
    'object_id': tf.int32
        }),

And also editing the _generate_examples definition so it also yields the object_id. However, once I construct the TensorFlow Dataset the IDs end up being in this format: (click on the image to actually see the ID) cutou_example

It seems that some of those IDs are the last 5 digits of the actual HSC `object_id' and I'm not sure why it's only giving me those 5 digits rather than the full 17 digit ID.

Also, the second image in the first row has ID = 24107 (and second row ID= 24106) but I don't even have an object in my dataset with that combination of digits in its HSC object_id so no clue where it got that from (I have one ending in 2410 but no 7 or 6)...mysterious

a-desmons commented 2 years ago

@EiffL Hi Francois, so I realised that I never actually built a new Tensorflow dataset, I just used your hsc_photoz one but with my own data, which worked fine for what I was doing.

I'm trying to build a new dataset now called hsc_tidal that I can use to load my images with known tidal features. So I run tfds new hsc_tidal and then edit the hsc_tidal.py file to define my dataset metadata and all that.

The problem is that even if I edit my hsc_tidal.py file to be identical to your hsc_photoz.py file (in your repo Tutorials/PhotozCNN/hsc_photoz - so it accesses your catalog.fits and cutout.hdf files from your ahw2019 google cloud bucket) and just change the class from HscPhotoz to HscTidal and change the __init__.py file accordingly I still get an error when I try to run tfds build

This is the output I get:

INFO[build.py]: Loading dataset  from path: /content/Tutorials/PhotozCNN/hsc_tidal/hsc_tidal.py
INFO[build.py]: download_and_prepare for dataset hsc_tidal/1.0.0...
INFO[dataset_builder.py]: Generating dataset hsc_tidal (/root/tensorflow_datasets/hsc_tidal/1.0.0)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/hsc_tidal/1.0.0...
Dl Completed...: 0 url [00:00, ? url/s]

INFO[download_manager.py]: Skipping download of https://storage.googleapis.com/ahw2019/hsc_photoz/data/catalog.fits: File cached in /root/tensorflow_datasets/downloads/ahw2019_hsc_photoz_catalogQ_4BXlo_y3-oVvbFFyaVJB7jN_MCDQE-sB-HcDN4jLM.fits
Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...:   0% 0/1 [00:00<?, ? url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 370.52 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 342.45 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 319.32 url/s]
Dl Size...: 100% 71582400/71582400 [00:00<00:00, 20393855906.10 MiB/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 256.36 url/s]
Dl Completed...: 0 url [00:00, ? url/s]

INFO[download_manager.py]: Skipping download of https://storage.googleapis.com/ahw2019/hsc_photoz/data/cutouts.hdf: File cached in /root/tensorflow_datasets/downloads/ahw2019_hsc_photoz_cutoutsJt4BBCrDdB76zu6zzjtJPIFfXK8u7MPOFijnlkTKRvU.hdf
Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...:   0% 0/1 [00:00<?, ? url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 652.81 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 576.93 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 516.54 url/s]
Dl Size...: 100% 22740792996/22740792996 [00:00<00:00, 11234605303450.50 MiB/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 402.18 url/s]
Generating splits...:   0% 0/1 [00:00<?, ? splits/s]
Generating train examples...: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/usr/local/bin/tfds", line 8, in <module>
    sys.exit(launch_cli())
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
    args.subparser_fn(args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
    _download_and_prepare(args, builder)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 345, in _download_and_prepare
    download_config=dl_config,
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 464, in download_and_prepare
    download_config=download_config,
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1196, in _download_and_prepare
    disable_shuffling=self.info.disable_shuffling,
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/split_builder.py", line 291, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/split_builder.py", line 363, in _build_from_generator
    writer.write(key, example)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/tfrecords_writer.py", line 278, in write
    self._shuffler.add(key, serialized_example)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/shuffle.py", line 226, in add
    hkey = self._hasher.hash_key(key)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/hashing.py", line 94, in hash_key
    md5.update(_to_bytes(key))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/hashing.py", line 69, in _to_bytes
    raise TypeError(f'Invalid key type: {data!r} ({type(data)})')
TypeError: Invalid key type: 36411448540287158 (<class 'numpy.int64'>)

It really makes no sense because in principle this is the exact same code you used to build the hsc_photoz dataset right?

I tried looking through the hashing.py file mentioned in the third-last line of the output but I don't see anything wrong there. Am I missing something? Are you able to build a new Tensorflow dataset on your end?

EiffL commented 2 years ago

Humm what is suspicious is the last line of this error message that references 'numpy.int64'

It would imply that either in the dataset definition or in the generation function you use numpy int64 instead of tf.int64 it may be possible that it used to work before but no longer in a new tf version?

In any case, can you push your code to this repo so that I can have a look?

a-desmons commented 2 years ago

Ahhh I see, I changed the object_id into a string in the _generate_examples function and it's working now. Thanks!

And yes I'll add the code to this repo under the name hsc_tidal

LSSTISSC / Tidalsaurus

Create a TensorFlow Dataset dataset with downloaded data #5