aws / amazon-s3-plugin-for-pytorch

Apache License 2.0
168 stars 21 forks source link

Reading object metadata #10

Open rkoo19 opened 2 years ago

rkoo19 commented 2 years ago

I was wondering if is way a way to also fetch an object's metadata when reading the object itself. I am trying to use ImageNet to train an image classification model, similar to what is done in s3_imagenet_example.py, but I am trying to add image class as metadata for the object itself.

rkoo19 commented 2 years ago

So, if I am to use map-style w/ S3Dataset, I want to be able to fetch the object itself from my S3 bucket, but to also be able to fetch a piece of metadata associated w/ that said object.

rkoo19 commented 2 years ago

I was reading the class definition for S3Dataset, and I saw that when getting an object, it uses some filename to fetch an object, but does nothing about metadata. I would like to, if there is not already do so, modify the procedure of getting an object from S3 to also fetch metadata associated w/ the object as well. I hope this makes sense, and I would appreciate any help I could get! Screen Shot 2021-11-15 at 12 43 14 PM

johnbensnyder commented 2 years ago

How is the object metadata stored? One possibility might be to use the S3BaseClass to write a custom method of reading the file object from S3, then use the filename to read metadata from some other source. For example, here's the setup I use to read an image and annotations for the COCO dataset.

def _load_image(self, image_id):
        if self.handler == None:
            self.handler = _pywrap_s3_io.S3Init()
        filename = os.path.join(self.root, self.coco.loadImgs(image_id)[0]["file_name"])
        fileobj = self.handler.s3_read(filename)
        return Image.open(io.BytesIO(fileobj)).convert("RGB")

def _load_target(self, image_id):
        return self.coco.loadAnns(self.coco.getAnnIds(image_id))

def __getitem__(self, idx):
        image_id = self.ids[idx]
        img = self._load_image(image_id)
        anno = self._load_target(image_id)
        target = self.build_target(anno, img.size)
        if self._transforms is not None:
                img, target = self._transforms(img, target)
        return img, target, idx
ydaiming commented 2 years ago

@johnbensnyder Thanks for helping on this issue! @rkoo19

We're upstreaming the amazon-s3-plugin-for-pytorch into the torchdata package (https://github.com/pytorch/data/pull/318). We're dropping support for this plugin.

The current s3 plugin doesn't have this feature, so do the new S3 IO datapipes. We'll backlog this feature request, and update the feature in the new S3 IO datapipes.