data-deduplication Search Results

mlfoundations/dclm #71

deduplication removes 98% of my data

Hi authors, I'm using `dedup/bff` to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command: …

Yu-Shi updated 4 days ago

lbr38/repomanager #194

Feature Request: Add deduplication support for storage manag…

Hi again ! A really important feature that should be added if you want to make a big step forward with the project is **data deduplication**. For instance, it can allow 2 snapshots of same repo co…

Cloud-Kid updated 1 week ago

red-hat-storage/ocs-ci #9838

Increase retry count for "test_mcg_data_deduplication"

Increase retry count for "test_mcg_data_deduplication" RP link: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/465/21100/1007975/1008084/log

udaysk23 updated 3 weeks ago

scylladb/scylladb #20459

Proposal: keep original sstable uuid generation in scylla me…

We use the sstable generation as part of the backup deduplication algorithm. But currently, when a tablet is migrated across shards in the same node, we "rename" it by creating hard-links to a new ge…

bhalevy updated 1 day ago

jacksonllee/pycantonese #50

Undocumented differences between the HKCanCor corpus on Hugg…

The version of HKCanCor published on [HuggingFace](https://huggingface.co/datasets/nanyang-technological-university-singapore/hkcancor/tree/main) by NTU is different from the version offered by this l…

AlienKevin updated 1 month ago

restatedev/restate #1873

Cache frequently accessed structures in `PartitionStore`

To avoid the serialization cost when accessing data stored in the `PartitionStore`, we could add an LRU cache for frequently accessed data. Prime candidates could be the deduplication table and the `I…

tillrohrmann updated 2 weeks ago

thanos-io/thanos #7656

Thanos Query: gaps in deduplicated data

**Thanos, Prometheus and Golang version used:** thanos, version 0.35.1 (branch: HEAD, revision: 086a698b2195adb6f3463ebbd032e780f39d2050) build user: root@be0f036fd8fa build date: 2…

ppietka-bp updated 1 week ago

dotnet/machinelearning #6700

Feture request: Data deduplication and Near-deduplication

In current Microsoft.ML developers may need to reduce size of huge datasets (#6679) or at least it might be advisable to do so: For many problems and algorithms hyperparameters tuning is important and…

torronen updated 7 months ago

superfly/litefs #313

Data Deduplication

Following up on https://github.com/superfly/litefs/issues/105, would it be possible to eventually add some kind of data deduplication if a lot of the tables are same across multiple DBs with only user…

darthShadow updated 1 year ago

PermafrostDiscoveryGateway/viz-staging #55

Integrate option to deduplicate data before tiling

Currently, deduplication in the visualization workflow starts _after_ the input data has been staged and tiled. If deduplication is set to occur at any step in the workflow (staging, rasterization, an…

julietcohen updated 4 weeks ago

1000+ results for data-deduplication

1000+ results
for data-deduplication