issues
search
bigscience-workshop
/
catalogue_data
Scripts to prepare catalogue data
Apache License 2.0
8
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
filter everything
#17
HugoLaurencon
closed
2 years ago
2
Replace \n with empty space
#16
thomasw21
closed
2 years ago
0
feature to look at filter out examples
#15
SaulLu
closed
2 years ago
0
add filter for small docs in datasets
#14
HugoLaurencon
closed
2 years ago
2
Add User filter on wiki page
#13
SaulLu
closed
2 years ago
0
Add wiki filter on "type" meta field
#12
SaulLu
closed
2 years ago
4
S2ORC vs Arxiv vs PMC
#11
lvwerra
opened
2 years ago
6
Setup training.
#10
thomasw21
closed
2 years ago
1
Catching crawling noise + ads
#9
TevenLeScao
opened
2 years ago
6
Removing dataset lm_en_a_million_news_headlines_abc_australia
#8
HugoLaurencon
opened
2 years ago
2
First draft of generic cleaning script
#7
thomasw21
closed
2 years ago
2
Repeated lines across examples
#6
TevenLeScao
opened
2 years ago
3
Wiki-based dataset cleaning
#5
TevenLeScao
opened
2 years ago
7
Roman/download tokenizer data
#4
RomanCast
closed
2 years ago
1
update metadata analysis
#3
lvwerra
closed
2 years ago
0
Allow to use the script to concatenate with pseudo crawled data
#2
thomasw21
closed
2 years ago
3
Metadata analysis
#1
lvwerra
closed
2 years ago
1
Previous