NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Semdedup #118

Open avem-nv opened 1 week ago

avem-nv commented 1 week ago

Description

This PR aims to implement the semantic de-duplication module introduced by Meta AI, scaled by RAPIDS primitives.

Usage

# Add snippet demonstrating usage

Checklist