issues
search
centre-for-humanities-computing
/
danish-foundation-models
A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68
stars
4
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add Danish-PD dataset
#289
peterbjorgensen
opened
3 days ago
0
Add collaborators from alexandra
#288
KennethEnevoldsen
closed
2 weeks ago
1
datasheets for the four datasets from DBC D1G1TAL
#287
peter-sk
closed
2 weeks ago
2
I can't find the raw datasets
#286
TTTTao725
opened
4 weeks ago
2
fix timestamps for colossal_oscar_1_0, hplt, ncc
#285
TTTTao725
closed
4 weeks ago
0
Mc4 conversion script
#284
peterbjorgensen
closed
1 month ago
0
Add lexdk dataset
#283
peterbjorgensen
closed
1 month ago
0
Convert domsdatabasen
#282
peterbjorgensen
closed
1 month ago
0
fix typo and continue fixing timestamps for other datasets
#281
TTTTao725
closed
1 month ago
0
Add datatrove pipeline blocks
#280
peterbjorgensen
closed
3 weeks ago
0
Add dataset guide
#279
KennethEnevoldsen
closed
1 month ago
0
Fix wrong license danews2.0.md
#278
KennethEnevoldsen
closed
1 month ago
1
fix invalid datasets
#277
TTTTao725
closed
1 month ago
1
Fix invalid datasets
#276
TTTTao725
closed
1 month ago
0
Invalid Datasets
#275
TTTTao725
opened
1 month ago
14
add raw datasheets for danews2.0 memo nordjylland_news scrape_hovedst…
#274
TTTTao725
closed
1 month ago
2
Issues in sub-dagws
#273
KennethEnevoldsen
opened
1 month ago
3
modify timestamps: ongoing. I'm still generating new datasets.
#272
TTTTao725
closed
1 month ago
2
docs: Added user-friendly tutorial for qlora
#271
KennethEnevoldsen
closed
2 weeks ago
4
Update index.md
#270
elliottd
closed
1 month ago
0
added dataset validator
#269
KennethEnevoldsen
closed
1 month ago
1
Modify the script for dagw: add a fixed source:dagw, rename the origi…
#268
TTTTao725
closed
1 month ago
9
New dataset: Finansministeriets udgivelser
#267
KennethEnevoldsen
opened
2 months ago
0
Add dataset validator
#266
KennethEnevoldsen
closed
1 month ago
9
Add datasheets
#265
KennethEnevoldsen
opened
2 months ago
1
NCC loader
#264
jankounchained
closed
2 months ago
3
add scripts for swedish gigaword and scrape from hovedstaden
#263
TTTTao725
closed
2 months ago
0
fix duplicated metadata in scandi reddit
#262
peterbjorgensen
closed
2 months ago
0
Add script to download AI aktindsigt (Sønderborg kommune) data and convert from xlsx to jsonl.gz
#261
peterbjorgensen
closed
1 month ago
1
CulturaX
#260
KennethEnevoldsen
closed
1 month ago
1
add gopher tagger with scandi stop words instead of english
#259
peterbjorgensen
closed
2 months ago
0
Refactor input file glob pattern in tokenize_jsonl.py
#258
rlrs
closed
2 months ago
0
infomdeia edge cases, memo nan handling
#257
TTTTao725
closed
2 months ago
0
add script to convert eur lex sum da to jsonl.gz
#256
peterbjorgensen
closed
2 months ago
0
Infomedia cleaning edge cases
#255
jankounchained
closed
2 months ago
4
convert ft speech data
#254
peterbjorgensen
closed
2 months ago
0
add dr facebook conversion script
#253
peterbjorgensen
closed
2 months ago
0
add scandi wikipedia conversion script
#252
peterbjorgensen
closed
2 months ago
0
Dolma url count
#251
peterbjorgensen
closed
2 months ago
2
Convert colossal oscar da
#250
peterbjorgensen
closed
2 months ago
0
Data config for 2024-v1 dataset
#249
peterbjorgensen
closed
2 months ago
0
Multiple runs at once & experiment tracking
#248
rlrs
opened
3 months ago
0
Add nlp_dedup
#247
rlrs
closed
4 months ago
0
Update tokenizer scripts
#246
rlrs
closed
2 months ago
0
Add scripts for converting memo
#245
TTTTao725
closed
3 months ago
15
Fix data package
#244
peterbjorgensen
closed
4 months ago
0
NCC dataset cleaning
#243
jankounchained
closed
2 months ago
1
Re-adding missing requirements for docs
#242
KennethEnevoldsen
closed
4 months ago
1
Url tagger
#241
peterbjorgensen
closed
2 months ago
2
Update llm-foundry to version with working FLOPS monitor
#240
rlrs
closed
4 months ago
0
Next