issues
search
bigscience-workshop
/
catalogue_data
Scripts to prepare catalogue data
Apache License 2.0
8
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix issue with config streamlit app
#67
SaulLu
closed
2 years ago
0
add a streamlit app to show PII logs
#66
SaulLu
closed
2 years ago
0
Multiprocessing with datasets in jsonl format
#65
HugoLaurencon
opened
2 years ago
0
Execute pii on the whole oscar dataset
#64
SaulLu
closed
2 years ago
0
[WIP] add multiprocessing for pii
#63
SaulLu
closed
2 years ago
0
Add streamlit viewer app
#62
SaulLu
closed
2 years ago
1
fixed typo in clean.py
#61
TevenLeScao
closed
2 years ago
2
Making sure that things are sorted
#60
thomasw21
closed
2 years ago
0
Concatenate ester dataset
#59
SaulLu
closed
2 years ago
0
Generalise deduplicate pattern.
#58
thomasw21
closed
2 years ago
0
new way to simplify dedup url
#57
SaulLu
closed
2 years ago
2
Make new experiment concerning filtering
#56
thomasw21
closed
2 years ago
1
Replace filter with map
#55
thomasw21
closed
2 years ago
0
Fix vi sent tokenizer
#54
lvwerra
closed
2 years ago
1
Fix stanza num proc dirty
#53
thomasw21
closed
2 years ago
0
Fix stanza num proc
#52
thomasw21
closed
2 years ago
1
remove whitespace before checking for emptyness
#51
lvwerra
closed
2 years ago
0
Generalise deduplication function
#50
thomasw21
opened
2 years ago
0
add sentence splitter functions
#49
lvwerra
closed
2 years ago
2
Update preprocessing key to use the new value from the google sheet
#48
thomasw21
closed
2 years ago
0
Add documentation.
#47
thomasw21
closed
2 years ago
0
Code doesn't need to run deduplication script
#46
thomasw21
closed
2 years ago
3
Remove unecessary deduplication
#45
thomasw21
closed
2 years ago
0
Add script to generate the columns for deduplication and short filter document
#44
thomasw21
closed
2 years ago
0
change way to compute the size of the text
#43
SaulLu
closed
2 years ago
3
Make scripts robust to meta format
#42
thomasw21
closed
2 years ago
0
Add deduplication script
#41
thomasw21
closed
2 years ago
0
Make substring stripper regex faster
#40
thomasw21
closed
2 years ago
0
Fix to accurate logging
#39
thomasw21
closed
2 years ago
0
Non-Wikipedia Wikis Dedup script
#38
cakiki
closed
2 years ago
1
Accurate size modification logging
#37
TevenLeScao
closed
2 years ago
0
Add deduplication on url level
#36
thomasw21
closed
2 years ago
2
Short document filter in byte
#35
thomasw21
closed
2 years ago
0
Compile regex
#34
thomasw21
closed
2 years ago
0
remove whitespace, numbers and punctuation before hashing
#33
lvwerra
closed
2 years ago
0
Remove short lines
#32
thomasw21
opened
2 years ago
0
add more line filters
#31
lvwerra
closed
2 years ago
1
Add substring remover mapper
#30
cakiki
closed
2 years ago
1
Let's save json when we need to
#29
thomasw21
closed
2 years ago
0
Opentiti fix
#28
lvwerra
closed
2 years ago
2
add "[if" and "<script" to list of excluded lines
#27
lvwerra
closed
2 years ago
0
Deduplication document
#26
thomasw21
closed
2 years ago
0
Use MD5 to obtain persistent hash
#25
thomasw21
closed
2 years ago
0
Test for wikis filters
#24
SaulLu
closed
2 years ago
1
Remove excessive duplicates
#23
thomasw21
closed
2 years ago
2
Curly fix
#22
lvwerra
closed
2 years ago
0
Slurm script
#21
thomasw21
closed
2 years ago
0
Allow deduplication scripts to be added to the preprocessing script
#20
thomasw21
closed
2 years ago
1
Add feature to see the modified examples by a map operation
#19
SaulLu
closed
2 years ago
2
Allow no maps or filters
#18
thomasw21
closed
2 years ago
2
Next