Closed mlissner closed 7 months ago
A few stats for this project, just based on what we know about in RECAP:
Page Count | Num docs | Cost |
---|---|---|
30-100 | 372,185 | $1,116,555 |
101-1000 | 88,857 | $266,571 |
1001-5000 | 1,296 | $3,888 |
5001-10000 | 25 | $75 |
10,001 and up | 3 | $9 |
Total | 462,366 | $1,387,098 |
So now we need the average token/page, and we'll know if this is worth pursuing.
OK, we got some results. I ran it twice to see how much it would change:
Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13343
Average tokens per document: 5669.205201229109
Average words per page: 230.35629168852583
Average tokens per page: 429.8012462539271
--------------------
Total number of recap documents: 13370694
The sample represents 0.100% of the Archive
Total number of tokens in the recap archive: 75.8 billion
Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10263
Average tokens per opinion: 2449.5310338107765
Average words per opinion: 1722.1270583650005
--------------------
Total number of opinions: 9728380
The sample represents 0.105% of the Caselaw
Total number of tokens in caselaw: 23.8 billion
And the second run:
Starting to retrieve the random RECAP dataset.
Computing averages.
Counting the total number of documents in the Archive.
Size of the dataset: 13050
Average tokens per document: 5911.472567049808
Average words per page: 228.53877394636015
Average tokens per page: 427.50561038447944
--------------------
Total number of recap documents: 13370695
The sample represents 0.098% of the Archive
Total number of tokens in the recap archive: 79.0 billion
Starting to retrieve the random Opinion dataset.
Computing averages.
Counting the total number of Opinions in the Archive.
Size of the dataset: 10570
Average tokens per opinion: 2464.71210974456
Average words per opinion: 1741.8554399243142
--------------------
Total number of opinions: 9728380
The sample represents 0.109% of the Caselaw
Total number of tokens in caselaw: 24.0 billion
One thing I'm realizing we still need is total words in each corpus. This gives us the total page count:
select sum(page_count) from search_recapdocument;
Once that completes, I should have some numbers...
We need a script to figure out a few token counts:
How many tokens are in the text of the RECAP Archive? Let's take random 10,000 items from the DB where is_available=True and there is text in the plaintext column. Let's count the tokens, and then extrapolate from there to the amount that is in the entire corpus.
We need a per-page token count from RECAP so we can see if buying the biggest documents makes economic sense these days. It just might.
Again, a random sample of documents would work well and we can do the simple thing: "The doc is ten pages, it has 15,000 words, therefore there are 1,500 words per page."
How many tokens are in case law? Same as above, but we have a few things to do before measuring tokens:
html_with_citations
. If it's blank, then we can use the fields in the order explained in the API: https://www.courtlistener.com/help/api/rest/#opinion-endpointCounting tokens should be done using
tiktoken
, OpenAI's tokenizer, and thecl100k_base
encoding model. A simple example of this is here:https://stackoverflow.com/questions/75804599/openai-api-how-do-i-count-tokens-before-i-send-an-api-request/75804651#75804651
Another is here:
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb