issues
search
bigscience-workshop
/
metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30
stars
12
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Prefix special tokens
#198
jordiclive
opened
1 year ago
0
feat: create a merged vld set of designated metadata
#197
tianjianjiang
closed
1 year ago
0
Train june
#196
jordiclive
opened
1 year ago
0
feat: enable generation_length_text
#195
tianjianjiang
closed
1 year ago
0
feat: add local special tokens for HTML
#194
tianjianjiang
closed
1 year ago
0
fix: avoid the situation of inf loss * weight → nan
#193
tianjianjiang
closed
1 year ago
0
Eval loop
#192
jordiclive
opened
1 year ago
0
Update tokenizer in evaluation
#191
manandey
closed
1 year ago
0
Set logits of entity related special tokens to -infinity
#190
manandey
closed
1 year ago
0
Configurable html sample rate, and without metada same context option.
#189
jordiclive
closed
1 year ago
1
Update metadata_utils.py
#188
jordiclive
opened
1 year ago
0
Update metadata_utils.py
#187
jordiclive
opened
1 year ago
0
fix: args and defaults
#186
tianjianjiang
closed
1 year ago
0
feat: update eval script
#185
tianjianjiang
closed
1 year ago
0
fix issue with embedding size too small
#184
jordiclive
closed
1 year ago
0
feat: v2.yaml for the resampled training set
#183
tianjianjiang
closed
1 year ago
0
Patch 3
#182
tianjianjiang
closed
1 year ago
0
Add prompting baseline eval
#181
ppommer
closed
1 year ago
0
Fix stuff and add new readme
#180
cccntu
closed
1 year ago
0
Changes for eval
#179
Muennighoff
closed
1 year ago
0
Add loss plotting
#178
Muennighoff
closed
1 year ago
0
debug code (WIP)
#177
cccntu
opened
1 year ago
0
Update train.py
#176
Muennighoff
closed
1 year ago
0
Fix mask bug
#175
cccntu
closed
1 year ago
0
Fix code quality tests
#174
manandey
closed
1 year ago
0
Add CM3 loss
#173
masoudjs
opened
1 year ago
2
test evaluation script
#172
cccntu
closed
1 year ago
0
evaluation script debugging
#171
cccntu
closed
1 year ago
0
ci: pin Python, Ubuntu, & GH Action versions
#170
tianjianjiang
closed
1 year ago
1
Fix eval script
#169
ppommer
closed
1 year ago
2
Minor updates in evaluation script
#168
manandey
closed
2 years ago
0
fix: ms-timestamp conversion
#167
tianjianjiang
closed
2 years ago
0
build: upgrade transformers for a consistent version of huggingface_hub
#166
tianjianjiang
closed
2 years ago
0
Add evaluation pipeline
#165
ppommer
closed
2 years ago
4
Updates
#164
cccntu
closed
1 year ago
2
Fix file list
#163
cccntu
closed
2 years ago
0
Refactor a util function
#162
cccntu
closed
2 years ago
0
add special tokens for entities
#161
manandey
closed
2 years ago
1
feat: log the average of the loss rather than the value of the main process
#160
SaulLu
closed
2 years ago
0
fix: timestamp precision
#159
tianjianjiang
closed
2 years ago
0
Fix streaming mode
#158
cccntu
closed
2 years ago
0
add new configs for entity_paragraph
#157
manandey
closed
2 years ago
1
new configuration for HTML tags
#156
SaulLu
closed
2 years ago
0
Finalizing for big training
#155
cccntu
closed
2 years ago
2
Revert "Filter examples by `num_chars` to include in a batch (#137)"
#154
manandey
closed
2 years ago
0
Add separate EntityParagraph processor
#153
manandey
closed
2 years ago
0
Adapt code to use new data format
#152
cccntu
closed
2 years ago
3
feat: clean up website desc.
#151
tianjianjiang
closed
2 years ago
0
feat: tag clean website desc., entity paragraph, and title
#150
tianjianjiang
closed
2 years ago
1
feat: add paragraph-entity metadata
#149
tianjianjiang
closed
2 years ago
0
Next