issues
search
bigscience-workshop
/
data_tooling
Tools for managing datasets for governance and training.
Apache License 2.0
74
stars
48
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update train_all.sh
#419
chris-ha458
opened
11 months ago
0
[pre-commit.ci] pre-commit autoupdate
#418
pre-commit-ci[bot]
opened
2 years ago
0
docstring not valid anymore
#417
HugoLaurencon
closed
2 years ago
0
Reason for not applying remove_non_prining_characters normalization
#416
JoeyOhman
opened
2 years ago
1
[pre-commit.ci] pre-commit autoupdate
#415
pre-commit-ci[bot]
closed
2 years ago
0
Citing this resource
#414
yuvalkirstain
opened
2 years ago
4
add html lang detector
#413
SaulLu
closed
2 years ago
0
Improved Deduplication
#412
ChenghaoMou
closed
2 years ago
2
[pre-commit.ci] pre-commit autoupdate
#411
pre-commit-ci[bot]
closed
2 years ago
0
[pre-commit.ci] pre-commit autoupdate
#410
pre-commit-ci[bot]
closed
2 years ago
0
Dedup exact lines training tokenizer dataset
#409
SaulLu
closed
2 years ago
0
Apply cleaning: remove deduplicated exact lines
#408
SaulLu
closed
2 years ago
1
Add exact document deduplicate scripts
#407
SaulLu
closed
2 years ago
0
Updated Anonymization
#406
ianyu93
closed
2 years ago
0
[pre-commit.ci] pre-commit autoupdate
#405
pre-commit-ci[bot]
closed
2 years ago
0
Updated apply_regex_anonymization
#404
ianyu93
closed
2 years ago
0
Flagged words in Bengali
#403
manandey
closed
2 years ago
1
Updated ac_dc
#402
ianyu93
closed
2 years ago
1
Update flagged_words.py
#401
majauhar
closed
2 years ago
1
Replacing pii_processing with muliwai
#400
huu4ontocord
closed
2 years ago
0
Update spanish flagged words
#399
edugp
closed
2 years ago
1
[pre-commit.ci] pre-commit autoupdate
#398
pre-commit-ci[bot]
closed
2 years ago
0
Pseudo crawl 2
#397
thomasw21
closed
2 years ago
0
Update Spanish flag words
#396
omarespejel
closed
2 years ago
1
Final update for Indonesian list of flagged words
#395
jtboing
closed
2 years ago
1
Update flagged_words.py for Catalan
#394
onadegibert
closed
2 years ago
3
Update flagged_words.py for Portuguese
#393
ruinunca
closed
2 years ago
1
further updates to flagged_words.py for indonesian
#392
jtboing
closed
2 years ago
1
Remove some characters from Chinese bad word list
#391
JetRunner
closed
2 years ago
4
Update flagged_words.py for Indonesian
#390
afaji
closed
2 years ago
5
organize repo for language annotation
#389
SaulLu
closed
2 years ago
0
add skeleton of slurms files for seeds batch 2
#388
SaulLu
closed
2 years ago
1
move dataset folder seeds batch 1
#387
SaulLu
closed
2 years ago
0
Refacto of the pseudo_crawl folder
#386
SaulLu
closed
2 years ago
0
[WIP] add slurm and python files to extend the pseudo crawl dataset with the seeds of batch 2
#385
SaulLu
closed
2 years ago
1
Reorganization folder to host the batch 2 of seeds to retrive for the pseudo crawl dataset
#384
SaulLu
closed
2 years ago
0
Pseudo crawl dataset creation
#383
thomasw21
closed
2 years ago
0
add files to compute basic stats on pseudo crawl dataset
#382
SaulLu
opened
2 years ago
3
[pre-commit.ci] pre-commit autoupdate
#381
pre-commit-ci[bot]
closed
2 years ago
0
Updated flagged_words.py
#380
abumafrim
closed
2 years ago
1
Update flagged words for vietnamese
#379
Luvata
closed
2 years ago
1
Create license-compliant version of the Pile: EuroParl
#378
albertvillanova
closed
2 years ago
1
add push to hub slurm script
#377
SaulLu
closed
2 years ago
0
Create license-compliant version of the Pile: Stack Exchange
#376
albertvillanova
closed
2 years ago
1
Empty Basque flagged words.
#375
asoroa
closed
2 years ago
3
Update flagged_words.py
#374
majauhar
closed
2 years ago
5
Update flagged_words.py
#373
majauhar
closed
2 years ago
1
Checked existing words + Added more words
#372
hbenyamina
closed
2 years ago
2
Adding initial set of flagged word in Tamil
#371
reshinthadithyan
closed
2 years ago
1
add scripts to extract text and HTML metadata from the column `html_str`
#370
SaulLu
closed
2 years ago
2
Next