Develop script to approximate a few token counts

mlissner commented 7 months ago

We need a script to figure out a few token counts:

How many tokens are in the text of the RECAP Archive? Let's take random 10,000 items from the DB where is_available=True and there is text in the plaintext column. Let's count the tokens, and then extrapolate from there to the amount that is in the entire corpus.
We need a per-page token count from RECAP so we can see if buying the biggest documents makes economic sense these days. It just might.

Again, a random sample of documents would work well and we can do the simple thing: "The doc is ten pages, it has 15,000 words, therefore there are 1,500 words per page."
How many tokens are in case law? Same as above, but we have a few things to do before measuring tokens:
- Find the best field. For this, we can probably use html_with_citations. If it's blank, then we can use the fields in the order explained in the API: https://www.courtlistener.com/help/api/rest/#opinion-endpoint
- Remove the HTML.

Counting tokens should be done using tiktoken, OpenAI's tokenizer, and the cl100k_base encoding model. A simple example of this is here:

https://stackoverflow.com/questions/75804599/openai-api-how-do-i-count-tokens-before-i-send-an-api-request/75804651#75804651

Another is here:

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

mlissner commented 7 months ago

A few stats for this project, just based on what we know about in RECAP:

Page Count	Num docs	Cost
30-100	372,185	$1,116,555
101-1000	88,857	$266,571
1001-5000	1,296	$3,888
5001-10000	25	$75
10,001 and up	3	$9
Total	462,366	$1,387,098

So now we need the average token/page, and we'll know if this is worth pursuing.

mlissner commented 7 months ago

OK, we got some results. I ran it twice to see how much it would change:

Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13343
Average tokens per document: 5669.205201229109
Average words per page: 230.35629168852583
Average tokens per page: 429.8012462539271
--------------------
Total number of recap documents: 13370694
The sample represents 0.100% of the Archive
Total number of tokens in the recap archive: 75.8 billion

Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10263
Average tokens per opinion: 2449.5310338107765
Average words per opinion: 1722.1270583650005
--------------------
Total number of opinions: 9728380
The sample represents 0.105% of the Caselaw
Total number of tokens in caselaw: 23.8 billion

And the second run:

Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13050
Average tokens per document: 5911.472567049808
Average words per page: 228.53877394636015
Average tokens per page: 427.50561038447944
--------------------
Total number of recap documents: 13370695
The sample represents 0.098% of the Archive
Total number of tokens in the recap archive: 79.0 billion
Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10570
Average tokens per opinion: 2464.71210974456
Average words per opinion: 1741.8554399243142
--------------------
Total number of opinions: 9728380
The sample represents 0.109% of the Caselaw
Total number of tokens in caselaw: 24.0 billion

One thing I'm realizing we still need is total words in each corpus. This gives us the total page count:

select sum(page_count) from search_recapdocument;

Once that completes, I should have some numbers...

freelawproject / courtlistener

Develop script to approximate a few token counts #3958