Collect 2T tokens - Githubissues

rodneykinney commented 1 year ago

Using instructions here to get some basic statistics on overlap between different crawls based on the content_digest field.

rodneykinney commented 1 year ago

3.1B distinct values for content_digest in the latest crawl

SELECT 
count(distinct content_digest)
FROM ai2_llm.ccindex
WHERE crawl in ('CC-MAIN-2023-06')
AND subset='warc'

3128644597

6.4B for the last two crawls:

SELECT 
count(distinct content_digest)
FROM ai2_llm.ccindex
WHERE crawl in ('CC-MAIN-2023-06', 'CC-MAIN-2022-49')
AND subset='warc'

6424142394

which is just the sum of the individual distinct content_digest counts, so the digest is not useful for deduping.

rodneykinney commented 1 year ago

LLaMa uses a pipeline called cc_net

dirkgr commented 1 year ago

I have three instances running in AWS that are downloading the most recent three checkpoints from CC: https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Instances:v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false

They don't have proper AI2 users configured. You can log in as the ubuntu user using the key that's stored under the name "Dirk's Key" or something like that. The AllenNLP AWS account is pretty barren, so everything is easy to find.

The original C4 code starts here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py#L506 It uses Apache Beam, so it's all written in this Apache Beam style.

PileV2 spreadsheet is here: https://docs.google.com/spreadsheets/d/19IAFhqRvhRxdUj-df8PUOBI2W8aEqGmJmBcZvXOuDZY/edit#gid=0 All the Reddit dumps point to a download location. I have not tried to see what happens when you download from there. For one thing, I don't know if you get the Reddit threads already straightened out, or if this is the raw version before any cleaning. Also, Eleuther being Eleuther, they are filtering toxic subreddits before they make it into the model. I don't think we should do that, but we should know how much toxic content there is. Our model needs to see some toxic content, so it can be used to filter later, but it should not see an overwhelming amount.

rodneykinney commented 1 year ago

Steps to install CCNet on AMI ami-0d70546e43a941d70:

sudo apt install cmake
sudo apt install build-essential libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libboost-test-dev
make install
pip install cc_net[getpy]

rodneykinney commented 1 year ago

First snapshot processing failed overall, but did leave some partial output. It produces json-lines files segmented by language:

$ ls mined_split/2019-09/1581/ | head -10
af_all.json.gz
af_all.json.gz.index
als_all.json.gz
als_all.json.gz.index
am_all.json.gz
am_all.json.gz.index
an_all.json.gz
an_all.json.gz.index
ar_all.json.gz
ar_all.json.gz.index

Sample line from the en output:

{
  "url": "http://1019therock.com/couple-and-mother-charged-in-ludlow-meth-bust/",
  "date_download": "2019-02-24T04:11:06Z",
  "digest": "sha1:LVY5PMQCUPDAGSFETJH2N2HIKGBOJSV4",
  "length": 1548,
  "nlines": 10,
  "source_domain": "1019therock.com",
  "title": "Couple and Mother Charged in Ludlow Meth Bust",
  "raw_content": "Couple and Mother Charged in Ludlow Meth Bust\nFor the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.\nThe arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,\nAroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.\nThe three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.\nThe MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.\nLast June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.\nThis is the 12th meth related incident in Maine this year, McCausland said.\nNEXT: Presque Isle Woman Arrested in Alleged Arson Fire\nFiled Under: Aroostook, arrest, Ludlow",
  "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550249578748.86/wet/CC-MAIN-20190224023850-20190224045850-00520.warc.wet.gz",
  "original_nlines": 122,
  "original_length": 3275,
  "line_ids": [
    85,
    90,
    92,
    93,
    94,
    95,
    96,
    97,
    98,
    99
  ],
  "language": "en",
  "language_score": 0.98,
  "bucket": "all"
}

I interpret this to mean that the original doc had 122 lines, and only 10 remained after de-duping.

rodneykinney commented 1 year ago

Failure looks like this issue

rodneykinney commented 1 year ago

The line-level deduping does a great job cleaning up the text. Here is the original document:

Couple and Mother Charged in Ludlow Meth Bust
What's Hot:
High School Basketball
Krazy Jake Live
Red Sox Road Trip
Community Spotlight
Jobs With Us
Patriots News
Newsletter
The Rock Mobile App
The Rock on Alexa
Maine News
New Brunswick News
Listen Live
Live In Concert
Golf Cards
Deals
Celtics Bus Trip
Quebec Winter Carnival
Pick 'Em 2018
Patriots Schedule
Sign In
Home
On Air
Full Schedule
Dick Palm
McKenzie Rae
Ultimate Classic Rock
Live In Concert
News on the Rock
Mark Shaw
Listen
Listen Live
Mobile App
Rock Squad
Pick 'Em 2018
Join Now
Rock Newsletter
Contests
Playlist
Events
Krazy Jake Live
Red Sox Road Trip
SOLD OUT: Celtics Bus Trip
SOLD OUT: Quebec Winter Carnival
Deals
Win Stuff
Contact
Help & Contact
Send Feedback
Advertise
Jobs With Us
More
Home
On Air
Full Schedule
Dick Palm
McKenzie Rae
Ultimate Classic Rock
Live In Concert
News on the Rock
Mark Shaw
Listen
Listen Live
Mobile App
Rock Squad
Pick 'Em 2018
Join Now
Rock Newsletter
Contests
Playlist
Events
Krazy Jake Live
Red Sox Road Trip
SOLD OUT: Celtics Bus Trip
SOLD OUT: Quebec Winter Carnival
Deals
Win Stuff
Contact
Help & Contact
Send Feedback
Advertise
Jobs With Us
Listen Now
101.9 The Rock101.9 The Rock
INSTAGRAM
Couple and Mother Charged in Ludlow Meth Bust
Mark Shaw
MDEA
Share on Twitter
Share on Facebook
For the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.
MDEA
The arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,
Aroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.
The three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.
The MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.
Last June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.
This is the 12th meth related incident in Maine this year, McCausland said.
NEXT: Presque Isle Woman Arrested in Alleged Arson Fire
Filed Under: Aroostook, arrest, Ludlow
Categories: Local News, Maine News
Comments
Leave A Comment
Back To Top
Featured
Patriots' Owner Robert Kraft Charged In Prostitution Sting
Recommended for You
Information
Loudwire Network
EEO
Marketing and Advertising Solutions
Public File
Report an Inaccuracy
Terms
VIP Terms
FAQ
Contest Rules
Privacy Policy (Updated: 12/14/18)
Contact
Business Listings
Follow Us
2019 101.9 The Rock is part of the Loudwire Network, Townsquare Media, Inc. All rights reserved.

After de-duping, it looks like this:

Couple and Mother Charged in Ludlow Meth Bust
For the second time in less than eight months, a southern Aroostook couple has been arrested on methamphetamine charges, and the woman's mother has also been charged.
The arrests came after Maine Drug Enforcement Agents say they found the makings of a meth lab inside a remote cabin in Ludlow, just west of Houlton, according to Public Safety department spokesman Steve McCausland. Agents were conducting a bail check Tuesday afternoon in relation to the charges from June 2015 when they made the discovery,
Aroostook County Sheriff’s Deputies and drug agents charged 31-year-old James Anthony, 26-year-old Kayla Nason, along with Nason’s mother, 48-year-old Tara Walton.
The three were arrested at the cabin on Townline Road Tuesday and charged with trafficking in methamphetamine and were taken to the Aroostook County Jail, McCausland said. Anthony and Walton were also charged with violating their bail conditions.
The MDEA’s meth lab response team was working at the cabin in Ludlow Wednesday to gather evidence and dispose of the dangerous and explosive chemicals.
Last June, Anthony and Nason were arrested after sheriff’s deputies found the two were cooking meth inside their car on the Ludlow Road in Ludlow. Nason at the time was treated and released for chemical burns as a result to her exposure to the methamphetamine.
This is the 12th meth related incident in Maine this year, McCausland said.
NEXT: Presque Isle Woman Arrested in Alleged Arson Fire
Filed Under: Aroostook, arrest, Ludlow

rodneykinney commented 1 year ago

A snapshot is divided into 1590 shards. Here's a token count for English-classified documents from a single shard of the 2019-09 snapshot:

$ gunzip --stdout ./0718/en_all.json.gz | jq '.raw_content' --raw-output | tr -cd ' \n' | wc -c
270894893

That would give us approximately 430B English tokens for the entire snapshot.

dirkgr commented 1 year ago

How do the token counts fall off when we add more snapshots?

dirkgr commented 1 year ago

Ah, also, we've been counting tokens in the other data sources using the unicode universal tokenizer. https://uniseg-py.readthedocs.io/en/latest/index.html is a Python version, but there is a version for C++ and Rust at least. For English it might not make a big difference, but it will for the other languages.

dirkgr commented 1 year ago

This is how I count tokens using uniseg: https://github.com/allenai/c5/blob/main/wet_path_to_pages.py#L17

rodneykinney commented 1 year ago

How do the token counts fall off when we add more snapshots?

The CCNET paper asserts "There is little content overlap between monthly snapshots" without explicitly computing the drop-off. In practical terms, you don't have enough RAM to fully dedupe even a single snapshot. They do find that the token counts start to flatten out even below 10% of a single snapshot.

https://www.semanticscholar.org/paper/CCNet%3A-Extracting-High-Quality-Monolingual-Datasets-Wenzek-Lachaux/c20c68c45127439139a08adb0b1f2b8354a94d6c/figure/6

The RAM requirements for deduping are shown here:

https://www.semanticscholar.org/paper/CCNet%3A-Extracting-High-Quality-Monolingual-Datasets-Wenzek-Lachaux/c20c68c45127439139a08adb0b1f2b8354a94d6c/figure/7

They settled on 3% of hashes used for deduping, although to my eyes, even using 1.5% is a pretty good trade-off. The overall process is RAM-bound, so you can double throughput by using the 1.5% threshold.

dirkgr commented 1 year ago

Not sure how to interpret those graphs. Does that say that after de-duping a single snapshot, we should expect less than 30% of the original content to remain? The fact that it flattens out is also confusing. As we add more data, novelty increases? Why would this happen?

The fact that we can't de-dupe even a single snapshot this way seems problematic. You know, we could write the O(n) BLOOM filter deduplication step in Scala or Java as well. And in fact, in a language that makes threads easy and fast like that, maybe we could bake in some other tricks.

rodneykinney commented 1 year ago

Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that most of the content consists of lines that are repeated over and over. It makes total sense when you look at the example of the original and de-duped document I pasted above.

dirkgr commented 1 year ago

Is it measuring by number of paragraphs removed, or number of characters? It makes sense that small paragraphs (1-2 words) would be duplicated a lot.

On Thu, Mar 9, 2023, 12:25 Rodney Kinney @.***> wrote:

Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that most of the content consists of lines that are repeated over and over. It makes total sense when you look at the example of the original and de-duped document I pasted above https://github.com/allenai/LLM/issues/1#issuecomment-1459091553.

— Reply to this email directly, view it on GitHub https://github.com/allenai/LLM/issues/1#issuecomment-1462725939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAYPU7LWV65JZP4QFEHPLW3I4D7ANCNFSM6AAAAAAVCNR5WQ . You are receiving this because you commented.Message ID: @.***>

rodneykinney commented 1 year ago

Is it measuring by number of paragraphs removed, or number of characters?

Those are characters.

rodneykinney commented 1 year ago

I have the pipeline tuned and running end-to-end. I've uploaded some sample data to s3://ai2-llm/pretraining-data/sources/common-crawl/samples/2019-09

The data is split by language. For each language, we have the option to split it up by perplexity buckets (head, middle, tail). For simplicity, I'm inclined to do this for English only.

The process is memory bound, at a cost of 20GB/thread, using a 1.5% sampling rate for deduping. An m6a.48xlarge has 768GB ram, so it can run ~35 threads. A single snapshot from 2019 yields about 400B tokens. The CCNET authors estimate it takes 5000 CPU hours to process. My own benchmarking suggests closer to 2500. That would be about 75 instance hours, or a dollar cost of $625. More recent snapshots are presumably larger. The cost per token will be the same if you keep the RAM usage at 20GB/thread, but will go up if you maintain the 1.5% sampling rate.

rodneykinney commented 1 year ago

Running on a u-3tb1 gives you more RAM per CPU, so the wall-clock time and dollar cost would be lower, about 17 instance hours and $450.

rodneykinney commented 1 year ago

Completed a run on a single snapshot to my satisfaction. Not uploading the full data to S3, but preserving it in this snapshot I will tweak the configuration and start systematically processing snapshots next week.

rodneykinney commented 1 year ago

Within a single dump, there is < 1% duplication by URL:

SELECT bucket, count(*)
FROM (
SELECT url,
CASE WHEN count = 1 THEN '1'
WHEN count < 6 THEN '2-5'
WHEN count < 11 THEN '6-10'
WHEN count < 21 THEN '11-20'
WHEN count < 51 THEN '21-50'
ELSE '51+' END AS bucket
FROM
(
SELECT url, count(1) as count
FROM ccindex
WHERE crawl = 'CC-MAIN-2023-06' 
  AND subset = 'warc'
  GROUP by url
) url_counts
) url_buckets
GROUP BY bucket

Occurrences URLs
1   3158028434
2-5 13162713
6-10    185912
11-20   55896
21-50   23288
51+ 7628

kyleclo commented 1 year ago

@rodneykinney under sampling of exact URL matches, does the text look highly similar?

rodneykinney commented 1 year ago

Within a single dump, there is < 1% duplication by URL

Athena timed out running the same query across multiple dumps

rodneykinney commented 1 year ago

With 3.1B unique URLs per dump, it would take about 70GB of RAM to hash them into the same data structure used by cc_net for paragraph-level deduping. So we could do exact URL-level deduping across all dumps on a single machine.

rodneykinney commented 1 year ago

Observations on using bff for paragraph-level deduping:

Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B documents).

Unfortunately, even though the false-positive rate is small, the rate of duplication is also small. Given that a paragraph was removed by the filter, the odds are about even that it was an actual duplicate vs. a false positive. I looked through examples of paragraphs that would have been removed, and the only examples I saw that were not false positives were duplicated within the document itself.

Given the cost, and the unknown effects of removing even < 1% of paragraphs at random, I don't think we should do probabilistic paragraph-level deduping. We should consider within-document exact paragraph-level deduping.

Exact URL deduping is tractable, but we don't have code that will do it. The CCNet code would only work single-threaded. Rust has a concurrent hash set, so we could implement it. We could also make minor modifications to bff to do probabilistic url deduping. It would run much faster: no tokenization, only one thing to hash. Dropping a complete document due to a false positive is better than dropping a paragraph because it doesn't affect the text's coherence. Because we would be sending fewer things through the filter, we could also make the false-positive rate much smaller.

dirkgr commented 1 year ago

We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big.

On Fri, Apr 7, 2023, 15:27 Rodney Kinney @.***> wrote:

Observations on using bff https://github.com/allenai/bff for paragraph-level deduping:

Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B documents).

Unfortunately, even though the false-positive rate is small, the rate of duplication is also small. Given that a paragraph was removed by the filter, the odds are about even that it was an actual duplicate vs. a false positive. I looked through examples of paragraphs that would have been removed, and the only examples I saw that were not false positives were duplicated within the document itself.

Given the cost, and the unknown effects of removing even < 1% of paragraphs at random, I don't think we should do probabilistic paragraph-level deduping. We should consider within-document exact paragraph-level deduping.

Exact URL deduping is tractable, but we don't have code that will do it. The CCNet code would only work single-threaded. Rust has a concurrent hash set https://docs.rs/flurry/latest/flurry/struct.HashSet.html, so we could implement it. We could also make minor modifications to bff to do probabilistic url deduping. It would run much faster: no tokenization, only one thing to hash. Dropping a complete document due to a false positive is better than dropping a paragraph because it doesn't affect the text's coherence. Because we would be sending fewer things through the filter, we could also make the false-positive rate much smaller.

— Reply to this email directly, view it on GitHub https://github.com/allenai/LLM/issues/1#issuecomment-1500695886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAYPXRCJ4NU5A5H3SQ7W3XACIEXANCNFSM6AAAAAAVCNR5WQ . You are receiving this because you commented.Message ID: @.***>

dirkgr commented 1 year ago

Wait, the 0.3% false positive rate is per ngram. But a paragraph needs to have 80% of it's ngrams come up positive to be removed. That should result in a way lower false positive rate for paragraphs. If you're seeing 0.3 per paragraph, something is up.

On Fri, Apr 7, 2023, 18:53 Dirk Groeneveld @.***> wrote:

We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big.

On Fri, Apr 7, 2023, 15:27 Rodney Kinney @.***> wrote:

Observations on using bff https://github.com/allenai/bff for paragraph-level deduping:

Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter, which has an estimated 0.3% false-positive rate for 100B n-grams. (One dump has ~500B tokens over ~3B documents).

Unfortunately, even though the false-positive rate is small, the rate of duplication is also small. Given that a paragraph was removed by the filter, the odds are about even that it was an actual duplicate vs. a false positive. I looked through examples of paragraphs that would have been removed, and the only examples I saw that were not false positives were duplicated within the document itself.

Given the cost, and the unknown effects of removing even < 1% of paragraphs at random, I don't think we should do probabilistic paragraph-level deduping. We should consider within-document exact paragraph-level deduping.

Exact URL deduping is tractable, but we don't have code that will do it. The CCNet code would only work single-threaded. Rust has a concurrent hash set https://docs.rs/flurry/latest/flurry/struct.HashSet.html, so we could implement it. We could also make minor modifications to bff to do probabilistic url deduping. It would run much faster: no tokenization, only one thing to hash. Dropping a complete document due to a false positive is better than dropping a paragraph because it doesn't affect the text's coherence. Because we would be sending fewer things through the filter, we could also make the false-positive rate much smaller.

— Reply to this email directly, view it on GitHub https://github.com/allenai/LLM/issues/1#issuecomment-1500695886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAYPXRCJ4NU5A5H3SQ7W3XACIEXANCNFSM6AAAAAAVCNR5WQ . You are receiving this because you commented.Message ID: @.***>

dirkgr commented 1 year ago

One more thought: The false positive rate it shows the rate at the end of the filtering, i.e., for the last ngram it puts in. For the first ngram the false positive probability is 0, and then it slowly goes up for most of the process, until it goes up a lot towards the end, as the filter fills up.

rodneykinney commented 1 year ago

From the analysis, there's about 1% rate of duplication by URL in a dump. Paragraph-level deduping is probably not the right way to handle these even if the error rate were zero. At best, we'd end up removing the relevant content from the duplicates, leaving behind a junk shell. Using the Bloom Filter to dedupe by URL will simply drop the dupes and is orders of magnitude faster. I've got a branch with a modified bff that I will test out.

rodneykinney commented 1 year ago

Deduped two combined dumps by URL. Number of removed documents was still ~1%, suggesting little overlap between dumps.

dirkgr commented 1 year ago

I deduped one of the dumps that came out of the C5 repo, and it removed over 30% of the data. Where does the difference come from?

rodneykinney commented 1 year ago

The deduping I'm running now is after the deduping already done by the CCNet code, which isn't exhaustive, but does remove a lot of the content. https://github.com/allenai/LLM/issues/1#issuecomment-1462725939

rodneykinney commented 1 year ago

Here's some data on the duplication rate across CC dumps.

Using Dirk's bloom filter to discard documents with a URL that have been seen before, here is the fraction of documents that are retained as we stream over 25 dumps, going backwards from the most recent.

The fraction of unseen URLs flattens out at about 30-40%, so each dump does continue to contribute distinct content. I would expect this to continue if we process more of them.

rodneykinney commented 1 year ago

Uploaded 25 URL-deduped dumps into s3://ai2-llm/pretraining-data/sources/common-crawl/v1/documents

100% English Compressed size is 11 TB High/Mid/Low fluency split is 20/25/55 % # of documents: ~3B # of tokens: 4.8T # of characters: ~30T

allenai / OLMo

Collect 2T tokens #1