Closed a-desmons closed 2 years ago
Never mind, I think I figured it out :)
:-) So yeah, that line is just to retrieve an already "compiled" dataset from google cloud storage to the local colab instance.
So, does that mean you successfuly compiled a dataset and found a way to upload it to colab?
Yep! I think everything is working fine, I'm just playing around with the cutout sizes at the moment and trying to find a good size.
Also, I managed to get the HSC bulk cutout task working by running the photoz_inference_data_preparation.ipynb document in your Tutorials -> PhotozCNN repo and running it through Colab. What I did before was copy the code into a new script and try to run it locally which didn't work, still not sure why, but at least it works now :)
@EiffL Are the object IDs stored anywhere in the TensorFlow Dataset? And if not, is there a reason why? (i.e. can I add them to the attributes?)
Hi @alidez , it depends on how you created the dataset but yeah you can store them, and that's a good idea to do so.
Could you push to this repo your code? This way I can tell you exactly where that would happen.
I added my hsc_photoz code to the repo, I'm assuming I just add an 'object_id' attribute to the list of attributes in the hsc_photoz.py file?
I also added my copy of the Self-Supervised-Example Colab notebook if you want to see how everything is working. I'm currently working on assembling a sample of interesting galaxies to see what kind of similar galaxies the code spits out :)
Yes that's exactly right, just a matter of adding object_id
to the list of attributes ;-)
Hi @EiffL, I added object_id
to the list of attributes but it gets saved in float format:
i.e 3.748525e+16 instead of 37485259083763257
which causes a bunch of objects to have the same object_id
. Is there a way I can save the object_id
attribute as an int?
I'm guessing it's somewhere in here in the hsc_photoz.py file, but I'm not sure where.
def _info(self) -> tfds.core.DatasetInfo:
"""Returns the dataset metadata."""
# TODO(hsc_photoz): Specifies the tfds.core.DatasetInfo object
return tfds.core.DatasetInfo(
builder=self,
description=_DESCRIPTION,
features=tfds.features.FeaturesDict({
'image': tfds.features.Tensor(shape=(128, 128, 5), dtype=tf.float32),
'attrs': {k: tf.float32 for k in _attrs}
}),
# If there's a common (input, target) tuple from the
# features, specify them here. They'll be used if
# `as_supervised=True` in `builder.as_dataset`.
supervised_keys=('image', 'attrs/specz_redshift'),
homepage='https://dataset-homepage/',
citation=_CITATION,
So I tried to define the object_id
outside of the list of attributes like so:
def _info(self) -> tfds.core.DatasetInfo:
"""Returns the dataset metadata."""
# TODO(hsc_photoz): Specifies the tfds.core.DatasetInfo object
return tfds.core.DatasetInfo(
builder=self,
description=_DESCRIPTION,
features=tfds.features.FeaturesDict({
'image': tfds.features.Tensor(shape=(128, 128, 5), dtype=tf.float32),
'attrs': {k: tf.float32 for k in _attrs},
'object_id': tf.int32
}),
And also editing the _generate_examples
definition so it also yields the object_id
. However, once I construct the TensorFlow Dataset the IDs end up being in this format: (click on the image to actually see the ID)
It seems that some of those IDs are the last 5 digits of the actual HSC `object_id' and I'm not sure why it's only giving me those 5 digits rather than the full 17 digit ID.
Also, the second image in the first row has ID = 24107 (and second row ID= 24106) but I don't even have an object in my dataset with that combination of digits in its HSC object_id
so no clue where it got that from (I have one ending in 2410 but no 7 or 6)...mysterious
@EiffL Hi Francois, so I realised that I never actually built a new Tensorflow dataset, I just used your hsc_photoz
one but with my own data, which worked fine for what I was doing.
I'm trying to build a new dataset now called hsc_tidal
that I can use to load my images with known tidal features.
So I run tfds new hsc_tidal
and then edit the hsc_tidal.py
file to define my dataset metadata and all that.
The problem is that even if I edit my hsc_tidal.py
file to be identical to your hsc_photoz.py
file (in your repo Tutorials/PhotozCNN/hsc_photoz - so it accesses your catalog.fits and cutout.hdf files from your ahw2019 google cloud bucket) and just change the class from HscPhotoz
to HscTidal
and change the __init__.py
file accordingly I still get an error when I try to run tfds build
This is the output I get:
INFO[build.py]: Loading dataset from path: /content/Tutorials/PhotozCNN/hsc_tidal/hsc_tidal.py
INFO[build.py]: download_and_prepare for dataset hsc_tidal/1.0.0...
INFO[dataset_builder.py]: Generating dataset hsc_tidal (/root/tensorflow_datasets/hsc_tidal/1.0.0)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/hsc_tidal/1.0.0...
Dl Completed...: 0 url [00:00, ? url/s]
INFO[download_manager.py]: Skipping download of https://storage.googleapis.com/ahw2019/hsc_photoz/data/catalog.fits: File cached in /root/tensorflow_datasets/downloads/ahw2019_hsc_photoz_catalogQ_4BXlo_y3-oVvbFFyaVJB7jN_MCDQE-sB-HcDN4jLM.fits
Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...: 0% 0/1 [00:00<?, ? url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 370.52 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 342.45 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 319.32 url/s]
Dl Size...: 100% 71582400/71582400 [00:00<00:00, 20393855906.10 MiB/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 256.36 url/s]
Dl Completed...: 0 url [00:00, ? url/s]
INFO[download_manager.py]: Skipping download of https://storage.googleapis.com/ahw2019/hsc_photoz/data/cutouts.hdf: File cached in /root/tensorflow_datasets/downloads/ahw2019_hsc_photoz_cutoutsJt4BBCrDdB76zu6zzjtJPIFfXK8u7MPOFijnlkTKRvU.hdf
Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...: 0% 0/1 [00:00<?, ? url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 652.81 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 576.93 url/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 516.54 url/s]
Dl Size...: 100% 22740792996/22740792996 [00:00<00:00, 11234605303450.50 MiB/s]
Dl Completed...: 100% 1/1 [00:00<00:00, 402.18 url/s]
Generating splits...: 0% 0/1 [00:00<?, ? splits/s]
Generating train examples...: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
File "/usr/local/bin/tfds", line 8, in <module>
sys.exit(launch_cli())
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
app.run(main, flags_parser=_parse_flags)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
args.subparser_fn(args)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
_download_and_prepare(args, builder)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/scripts/cli/build.py", line 345, in _download_and_prepare
download_config=dl_config,
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 464, in download_and_prepare
download_config=download_config,
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/dataset_builder.py", line 1196, in _download_and_prepare
disable_shuffling=self.info.disable_shuffling,
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/split_builder.py", line 291, in submit_split_generation
return self._build_from_generator(**build_kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/split_builder.py", line 363, in _build_from_generator
writer.write(key, example)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/tfrecords_writer.py", line 278, in write
self._shuffler.add(key, serialized_example)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/shuffle.py", line 226, in add
hkey = self._hasher.hash_key(key)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/hashing.py", line 94, in hash_key
md5.update(_to_bytes(key))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_datasets/core/hashing.py", line 69, in _to_bytes
raise TypeError(f'Invalid key type: {data!r} ({type(data)})')
TypeError: Invalid key type: 36411448540287158 (<class 'numpy.int64'>)
It really makes no sense because in principle this is the exact same code you used to build the hsc_photoz
dataset right?
I tried looking through the hashing.py
file mentioned in the third-last line of the output but I don't see anything wrong there. Am I missing something? Are you able to build a new Tensorflow dataset on your end?
Humm what is suspicious is the last line of this error message that references 'numpy.int64'
It would imply that either in the dataset definition or in the generation function you use numpy int64 instead of tf.int64 it may be possible that it used to work before but no longer in a new tf version?
In any case, can you push your code to this repo so that I can have a look?
Ahhh I see, I changed the object_id
into a string in the _generate_examples
function and it's working now. Thanks!
And yes I'll add the code to this repo under the name hsc_tidal
@EiffL I'm just a little confused about the first cell of the "Contrastive Learning on HSC images" notebook.
Do I need to run it when using my own data? What is it retrieving exactly? I thought the cutout file and catalogue file were retrieved in cell 5 by running the hsc_photoz cloned from github: