allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

Adds dclm fasttext classifier #205

Closed undfined closed 1 month ago

undfined commented 1 month ago

Example output:

{
  "id": "http://antichoiceantiawesome.blogspot.com/2012/01/",
  "attributes": {
    "dclm_fasttext__dclm_oh_eli5__score": [
      [
        0,
        14731,
        0.02031
      ]
    ]
  },
  "source": "dclm-hero-run-fasttext_for_HF"
}
{
  "id": "http://atomphotocomp.org/2018-judge/andrea-ulbrick/",
  "attributes": {
    "dclm_fasttext__dclm_oh_eli5__score": [
      [
        0,
        761,
        0.05164
      ]
    ]
  },
  "source": "dclm-hero-run-fasttext_for_HF"
}
---original---
0.07910442352294922
0.021570026874542236
0.024592816829681396
0.07086175680160522
0.221385657787323
0.06914657354354858
0.03766465187072754
0.02159935235977173
0.020305633544921875
0.051635682582855225

---dolma tagger---
0.0791
0.02157
0.02459
0.07086
0.22139
0.06915
0.03766
0.0216
0.02031
0.05164