-
**Description**
We want all the opensource Tibetan word segmented data and save it in a standard format.
The format should be:
```
[
{
'source': 'བོད་ཀྱི་གླུ་གར་རོལ་དབྱངས་ལ་གཞི་རྩའི་ཐོག་ནས་དབྱ…
-
Hi everyone,
I've created this issue under `rsroll` because my initial idea was to introduce authors of 3-different software packages focused rolling sums / CDC, and `rsroll` is the one that `rdedu…
-
I have few hundred datasets of the following structure (which I cannot change):
* full_dataset/
* train_dataset/
* test_dataset/
* extra_files
Currently I am creating for dat repositories for…
mitar updated
6 years ago
-
When setting up the cluster environment, I want to run a deduplication task for a large data set (1T, stored locally), but how should I load the data? Should I put all the data on the supervisor node …
-
## Use Case
I am trying to use the Puppet LVM module to create VDO volumes on LVM but it is missing the possibility to send the necessary parameters to lvcreate. I am not able to add --virtualsize an…
-
### Is your feature request related to a problem? Please describe
Streaming aggregation is a great alternative to recording rules especially for data-intensive applications recording rules can't pr…
-
1) compiled with ` -Wall -O2 -fsanitize=address,undefined -fno-omit-frame-pointer -g3 -march=native -flto`
2) `UBSAN_OPTIONS=print_stacktrace=1 ./duperemove ../wesnoth -rdh --dedupe-options=partia…
-
**Describe the bug**
When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.
**Steps/Code to reproduce bug**
1) Clone the repo
2) Run the TinySto…
-
A wrapper around `get_all_polio_data()` which takes in a single parameter: `local_dataset`.
The function can be called `update_polio_data` and will download the `small` polio dataset and merge in …
-
When you do the provider verification, the deduplication mechanism uses the full contract content as the basis for deduplication.
This means that if the example data in the contract is different betw…