freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
553 stars 151 forks source link

Develop script to approximate a few token counts #3958

Closed mlissner closed 7 months ago

mlissner commented 7 months ago

We need a script to figure out a few token counts:

  1. How many tokens are in the text of the RECAP Archive? Let's take random 10,000 items from the DB where is_available=True and there is text in the plaintext column. Let's count the tokens, and then extrapolate from there to the amount that is in the entire corpus.

  2. We need a per-page token count from RECAP so we can see if buying the biggest documents makes economic sense these days. It just might.

    Again, a random sample of documents would work well and we can do the simple thing: "The doc is ten pages, it has 15,000 words, therefore there are 1,500 words per page."

  3. How many tokens are in case law? Same as above, but we have a few things to do before measuring tokens:


Counting tokens should be done using tiktoken, OpenAI's tokenizer, and the cl100k_base encoding model. A simple example of this is here:

https://stackoverflow.com/questions/75804599/openai-api-how-do-i-count-tokens-before-i-send-an-api-request/75804651#75804651

Another is here:

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

mlissner commented 7 months ago

A few stats for this project, just based on what we know about in RECAP:

Page Count Num docs Cost
30-100 372,185 $1,116,555
101-1000 88,857 $266,571
1001-5000 1,296 $3,888
5001-10000 25 $75
10,001 and up 3 $9
Total 462,366 $1,387,098

So now we need the average token/page, and we'll know if this is worth pursuing.

mlissner commented 7 months ago

OK, we got some results. I ran it twice to see how much it would change:

Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13343
Average tokens per document: 5669.205201229109
Average words per page: 230.35629168852583
Average tokens per page: 429.8012462539271
--------------------
Total number of recap documents: 13370694
The sample represents 0.100% of the Archive
Total number of tokens in the recap archive: 75.8 billion

Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10263
Average tokens per opinion: 2449.5310338107765
Average words per opinion: 1722.1270583650005
--------------------
Total number of opinions: 9728380
The sample represents 0.105% of the Caselaw
Total number of tokens in caselaw: 23.8 billion

And the second run:

Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13050
Average tokens per document: 5911.472567049808
Average words per page: 228.53877394636015
Average tokens per page: 427.50561038447944
--------------------
Total number of recap documents: 13370695
The sample represents 0.098% of the Archive
Total number of tokens in the recap archive: 79.0 billion
Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10570
Average tokens per opinion: 2464.71210974456
Average words per opinion: 1741.8554399243142
--------------------
Total number of opinions: 9728380
The sample represents 0.109% of the Caselaw
Total number of tokens in caselaw: 24.0 billion

One thing I'm realizing we still need is total words in each corpus. This gives us the total page count:

select sum(page_count) from search_recapdocument;

Once that completes, I should have some numbers...