issues
search
bigscience-workshop
/
metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30
stars
12
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
data analysis: website description (quality and yield)
#98
norakassner
opened
2 years ago
0
Start joint training
#97
norakassner
opened
2 years ago
0
eval hyperparameters: occupied tokens
#96
norakassner
opened
2 years ago
0
entity tagging speedup
#95
norakassner
closed
2 years ago
1
estimate amount of data
#94
norakassner
opened
2 years ago
0
eval hyperparameters: amount of metadata
#93
norakassner
opened
2 years ago
0
method to sample global metadata
#92
norakassner
opened
2 years ago
0
method to sample local metadata
#91
norakassner
opened
2 years ago
0
explore hyperparameters:
#90
norakassner
opened
2 years ago
0
simple zero-shot eval function: time stamps
#89
norakassner
opened
2 years ago
0
simple zero-shot eval function: website description
#88
norakassner
opened
2 years ago
1
simple zero-shot eval function: datasource
#87
norakassner
opened
2 years ago
0
simple zero-shot eval function: entity tags
#86
norakassner
opened
2 years ago
0
simple zero-shot eval function: HTML tags
#85
norakassner
opened
2 years ago
0
simple zero-shot eval function: generation length
#84
norakassner
opened
2 years ago
1
Handle the comment specific type not recognized by pyarrow
#83
SaulLu
closed
2 years ago
1
Change torch version + make it optional
#82
SaulLu
closed
2 years ago
0
update: generation length and datasource
#81
chkla
closed
2 years ago
0
Update entity-tags preprocessing code to speed up the process
#80
manandey
closed
2 years ago
0
question: Error raised by entities extrator
#79
SaulLu
closed
2 years ago
2
Improve: Entity extraction speed
#78
SaulLu
closed
2 years ago
1
fix: reduce try/except scope into `fetch_wikipedia_description_for_title`
#77
SaulLu
closed
2 years ago
0
[WIP] feat: create `c4/en` toy dataset with `entity`, `website_description` and `timestamp` metadata
#76
SaulLu
closed
2 years ago
0
add entity description while preprocessing entities
#75
manandey
closed
2 years ago
0
feat: add `dumpdb` toy script
#74
SaulLu
closed
3 years ago
1
fix: change `bs_dateuil` github dependency in `setup.py`
#73
SaulLu
closed
3 years ago
0
feat: add checkpoints conversion and pushing to the hub script
#72
SaulLu
closed
3 years ago
1
feat: refine dependency imports for preprocessing
#71
SaulLu
opened
3 years ago
0
feat: force redownload of the dataset on jz + rename entity jobs
#70
SaulLu
closed
3 years ago
0
adding scripts for testing cleaner dataset
#69
shanyas10
closed
3 years ago
0
Include entity description
#68
manandey
closed
2 years ago
0
feat: save the model and stop training based on `exit-duration-in-mins`
#67
SaulLu
opened
3 years ago
2
Conflict in torch version
#66
manandey
closed
2 years ago
2
Replace dateutil submodule with renamed package
#65
cccntu
closed
3 years ago
2
Fix wandb config
#64
cccntu
closed
3 years ago
0
Rename dateutil submodule with another name
#63
cccntu
closed
3 years ago
2
feat: support HF dataset loader and streaming mode for bs-modeling-metadata/openwebtext-html-cc
#62
tianjianjiang
closed
2 years ago
1
fix: clean up OpenWebText URL fragments, duplicates, variants, and malformed ones
#61
tianjianjiang
closed
2 years ago
0
perf: reduce file sizes of openwebtext-html-cc
#60
tianjianjiang
closed
3 years ago
1
perf: separate C4 or CommonCrawl URLs from OpenWebText URLs
#59
tianjianjiang
closed
2 years ago
1
perf: filter 404 URLs out by Common Crawl's cluster.idx
#58
tianjianjiang
closed
3 years ago
4
feat: add html only experiments
#57
SaulLu
opened
3 years ago
0
Datasource and GenerationLength
#56
chkla
closed
2 years ago
1
Preprocess Entity Tags
#55
manandey
closed
3 years ago
1
feat: add URL tokenizer(s) for data source, time stamp, and website description
#54
tianjianjiang
closed
2 years ago
1
feat: add html preprocessor and html parser
#53
SaulLu
closed
3 years ago
0
preprocess entity tags
#52
manandey
closed
3 years ago
1
feat: get OpenWebText from Common Crawl
#51
tianjianjiang
closed
2 years ago
0
Add timestamp processor
#50
cccntu
closed
3 years ago
2
Adding WebsiteMetadataProcessor to preprocessing_utils
#49
shanyas10
closed
3 years ago
2
Previous
Next