issues
search
bigscience-workshop
/
metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30
stars
12
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
feat: add title
#148
tianjianjiang
closed
2 years ago
1
Post processing website desc
#147
shanyas10
closed
2 years ago
1
add example that build a dataset
#146
SaulLu
closed
2 years ago
0
feat: add paragraphs
#145
tianjianjiang
closed
2 years ago
0
Entity at paragraph level
#144
manandey
closed
2 years ago
0
Additional changes to test the entities extraction
#143
SaulLu
closed
2 years ago
0
[WIP] entities extraction tentative 2
#142
SaulLu
closed
2 years ago
0
Create evaluation_utils.py
#141
shanyas10
closed
2 years ago
0
test gpt2-xl
#140
cccntu
closed
2 years ago
3
build: sync setup.py defined dependencies and fix broken ones
#139
tianjianjiang
closed
2 years ago
1
Common simple eval function to calculate ppl
#138
shanyas10
opened
2 years ago
0
Filter examples by `num_chars` to include in a batch
#137
manandey
closed
2 years ago
7
Fix accelerate not using multi-GPU
#136
cccntu
closed
2 years ago
1
Update add_metadata.py
#135
manandey
closed
2 years ago
1
`Title` preprocessor
#134
manandey
closed
2 years ago
0
Add `title` metadata processor
#133
manandey
closed
2 years ago
0
build: pin datasets to 1.17.0
#132
tianjianjiang
closed
2 years ago
0
Add script to convert the dataset in compressed jsonlines files
#131
SaulLu
closed
2 years ago
0
build: bump nltk to 3.6.7 for security and performance
#130
tianjianjiang
closed
2 years ago
0
multi gpu training
#129
norakassner
opened
2 years ago
0
feat: mark paragraphs by metadata-html #125
#128
tianjianjiang
closed
2 years ago
0
Add filters to `HtmlProcessor`
#127
SaulLu
closed
2 years ago
1
Remove entity description
#126
manandey
closed
2 years ago
1
feat: HTML scanner for text content & content sectioning elements → segment paragraphs
#125
tianjianjiang
closed
2 years ago
0
Create Dataset with metadata
#124
SaulLu
opened
2 years ago
0
Add fp16, multi-GPU training script (toy dataset)
#123
cccntu
closed
2 years ago
0
feat: find a solution to load the dataset
#122
SaulLu
opened
2 years ago
0
bug: fix the toy joint training slurm scripts
#121
SaulLu
closed
2 years ago
0
Reach out evaluation working group: Can we draw from their pipeline?
#120
norakassner
closed
2 years ago
0
create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML
#119
SaulLu
closed
2 years ago
0
Add features types for the metadata to extract and test multiprocessing
#118
SaulLu
closed
2 years ago
2
Use dateutil to parse date
#117
cccntu
closed
2 years ago
0
feat: add a feature to choose where to extract metadata
#116
SaulLu
closed
2 years ago
0
feat: change how the entity extraction process use ids
#115
SaulLu
closed
2 years ago
1
How do we define a paragraph?
#114
norakassner
closed
2 years ago
1
Which HTML tags should be used during training?
#113
norakassner
opened
2 years ago
0
manage PR merge strategies
#112
norakassner
closed
2 years ago
1
Add joint training slurm script
#111
cccntu
closed
2 years ago
0
Which HTML tags should be used during training?
#110
norakassner
closed
2 years ago
1
slurm script test multi-gpu training
#109
norakassner
opened
2 years ago
0
Evaluation bias
#108
norakassner
opened
2 years ago
1
Revert "update: generation length and datasource"
#107
SaulLu
closed
2 years ago
0
add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet
#106
SaulLu
closed
2 years ago
0
Remove the vendor folder
#105
SaulLu
closed
2 years ago
1
change: retrieve the url from a column instead of from the metadata
#104
SaulLu
closed
2 years ago
0
Add code to sampling multiple metadata
#103
cccntu
closed
2 years ago
1
Handle IndexError in `WikipediaDescUtils` for wikitext
#102
SaulLu
closed
2 years ago
3
Error `IndexError: list index out of range` while testing the entities extraction
#101
SaulLu
closed
2 years ago
0
Discuss style evaluation for website description and data source with Anna
#100
norakassner
opened
2 years ago
0
Evaluation toxicity for website description and data source
#99
norakassner
opened
2 years ago
1
Previous
Next