NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.06k stars 615 forks source link

New Apache Cassandra plugin for NVIDIA DALI #4218

Open fversaci opened 2 years ago

fversaci commented 2 years ago

Hi everyone,

we have just published a plugin for DALI which allows reading raw image files (as well as labels) from a Cassandra NoSQL database, working as a replacement of, e.g., the standard fn.readers.file.

You can find the full repo, including a Dockerfile for easy testing, complete with a Cassandra server and a test dataset, at this URL:

https://github.com/crs4/cassandra-dali-plugin

The plugin employs internal multibuffering, allowing to hide high network latency, up to tens of milliseconds, while still supporting a high throughput.

For performance reasons, we wanted to have a fine control on the communications with the Cassandra server (e.g., to issue request in parallel for all samples in a batch), and we have hence chosen to extend the Operator class, instead of the Reader/Loader pair.

The project is new and still needs to be thoroughly tested, but most of its core features (including support for SSL communications) are complete and easily usable.

If you're interested in trying out the plugin, just clone the repo, build the included Dockerfile and follow the instructions of the annotated example. We would be glad to receive comments and suggestions.

Some background information on using Cassandra DB for feeding data to a ML pipeline (in a different context) can be found here:

https://ieeexplore.ieee.org/document/9672005 (PDF)

Thanks for your attention!

JanuszL commented 2 years ago

Hi @fversaci,

Thank you for your work, it looks very promising. It is fine not to use Reader/Loader pair. They mostly help with shuffling and delegating the data reading (IO itself) to a separate thread. Your approach is valid as well. Have you ever considered making it a part of DALI and filling a PR or you prefer to keep it separated as the target audience could be limited?

fversaci commented 2 years ago

Hi @JanuszL

thanks for your interest!

We think that this approach can have general applicability and has the potential to be used in many contexts: once the DB is setup it is actually very convenient to fetch the data, where/whenever needed across the network, without the need to move them or to configure some network/parallel filesystem.

We would be happy to integrate it as a part of DALI, if you think it may fit. To this purpose we would ask you to have a look at the DALI-ish part of the code to check that it follows your standards, since we wrote it mostly by reading and reverse engineering the existing code.

Let us know if you have feedback and, in case you're interested in a PR, how we can proceed for a better integration in DALI.

Cheers

JanuszL commented 2 years ago

Hi @fversaci,

Thank you for your prompt response.

I fully support your use case and it is really nice to see that DALI gets integrated with more and more solutions. Let me check with the team how well this particular reader is aligned with the DALI scope. After that, we can discuss how to proceed further. Probably the best would be to open a PR to DALI and let us do a review (it is easier to submit comments in the PR than in this issue where they are disjoined with the code).