chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.36k stars 1.14k forks source link

[Feature Request]: ImageLoader should also load images from remote URLs, not just local URIs #2401

Open ahron1 opened 1 week ago

ahron1 commented 1 week ago

Describe the problem

Currently, the ImageLoader data loader loads only locally stored images.

For loading remote images, one needs to write a wrapper to fetch the images first. It will be very helpful ImageLoader would also directly load remote images over HTTP/S.

Describe the proposed solution

It is an easy fix. Just add an extra block to the _load_image definition in ImageLoader, to check if it is a remote or local URI. If it is a HTTP/S URI, use PIL and BytesIO to fetch the image before converting it to a NumPy array.

I already have the code working locally. Happy to submit a PR if people are open to the idea.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

jeffchuber commented 5 days ago

@ahron1 thanks for suggesting this!

@atroyn what do you think here?

jeffchuber commented 5 days ago

@ahron1 looking at the PR you opened - and reviewing the multimodal docs - I think that this functionality should be in a new loader class... something like ImageUrLoader. I'm not seeing the case where someone would want to pass both remote and local filesystem requests and the magic of supporting both could lead to confusion/bugs.

ahron1 commented 4 days ago

Thanks for checking!

Indeed, it is better to split it off. With @tazarov 's idea about adding in authentication, it becomes a standalone remote image loader.

atroyn commented 4 days ago

I agree that this ought to be standalone, though the counter-argument is that other data loaders, e.g. HuggingFace datasets and similar support both kinds of loading. I think in our case however, we do want these to be separate, since they're not about downloading data, but getting data into Chroma.

ahron1 commented 3 days ago

HuggingFace was why I originally thought of handling both URIs and URLs together. But indeed, here it should be standalone. I just pushed the changes.