-
Hi authors, I'm using `dedup/bff` to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:
…
-
Hi again !
A really important feature that should be added if you want to make a big step forward with the project is **data deduplication**. For instance, it can allow 2 snapshots of same repo co…
-
Increase retry count for "test_mcg_data_deduplication"
RP link: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/465/21100/1007975/1008084/log
-
We use the sstable generation as part of the backup deduplication algorithm.
But currently, when a tablet is migrated across shards in the same node, we "rename" it by creating hard-links to a new ge…
-
The version of HKCanCor published on [HuggingFace](https://huggingface.co/datasets/nanyang-technological-university-singapore/hkcancor/tree/main) by NTU is different from the version offered by this l…
-
To avoid the serialization cost when accessing data stored in the `PartitionStore`, we could add an LRU cache for frequently accessed data. Prime candidates could be the deduplication table and the `I…
-
**Thanos, Prometheus and Golang version used:**
thanos, version 0.35.1 (branch: HEAD, revision: 086a698b2195adb6f3463ebbd032e780f39d2050)
build user: root@be0f036fd8fa
build date: 2…
-
In current Microsoft.ML developers may need to reduce size of huge datasets (#6679) or at least it might be advisable to do so: For many problems and algorithms hyperparameters tuning is important and…
-
Following up on https://github.com/superfly/litefs/issues/105, would it be possible to eventually add some kind of data deduplication if a lot of the tables are same across multiple DBs with only user…
-
Currently, deduplication in the visualization workflow starts _after_ the input data has been staged and tiled. If deduplication is set to occur at any step in the workflow (staging, rasterization, an…