Extract topics from documents

AtlantisPleb commented 1 year ago

Given a text document (PDF, Markdown, etc.) of any length, we need to extract a list of relevant topics, perhaps 5-20 or more.

This is to help us generate question/answer keypairs programmatically from a document which can be used for finetuning a model.

Once we have a list of topics, we can easily query the entire document for each topic for enough context to generate relevant question/answer keypairs via an LLM.

The question is, what is the best way to generate the list of topics?

Conceivably this is what topic modeling is for, but our current implementation using Gensim and LDA gives bad results. We can explore third-party services like this but unsure topic modeling is even the right approach, and prefer to minimize dependence on third parties.

Maybe we should generate topics another way like using semantic search via a service like Vectara or an LLM API like Claude that works with long documents.

Suggestions welcome!

Bounty: 1M sats to whomever describes an algorithm we end up using for this, or 2M sats if it's fully implemented with a PR we can easily merge.

quinteiroh commented 1 year ago

I really like the algo from this site. It's very useful

https://www.chatpdf.com/

And have API to use. I guess you can integrate with the GPUtopia as well

earonesty commented 1 year ago

anything that uses openai endpoints can use gputopia instead. we support embeddings, inference etc.

skeptrunedev commented 1 year ago

Hey, so this is not quite topic modeling but should work for your objective of programmatically creating question/answer training pairs. This is like the prototypical way of doing it.

Going to list out the steps here.

1. Identify some arbitrary set of documents

2. Chunk raw text from those documents into chunks

3. Use a text-embedding model to vector'ize the chunks

4. Prompt a LLM with something like "Write as many questions as possible for something relating to foo separated by commas" where "foo" is your database vertical

5. Take those questions and perform a brute force search with something like simsearch to make sure they are sufficiently different

6. Repeat steps 4 and 5 until you have ~200 sufficiently different questions/queries

7. Feed the questions to a LLM at a high temp and get the default output for each one (if your dataset is highly domain specific then you may want to use RAG here)

8. Vector'ize the outputs with a text embedding model and get the closest result by cosine-distance from the chunks you embedded in step 3 for each one

9. Construct a fine-tuning dataset from the search queries and results returned by step 8

AtlantisPleb commented 1 year ago

Cool, we're going with that. Awarding 1M sats to @skeptrunedev for the above

earonesty commented 1 year ago

Break in to chunks
Ask AI to generate questions for chink:

Create a list of questions and answers for training a fine-tuned AI on this document fragment, be sure to include the document date, form type and the name of the company in every question or answer.

<FRAGMENT HERE>

Here's what LLAMA came up with on a recent 10K SEC filing document, pasted as raw HTML and it's pretty good... more then good enough, imo.

Question: Regarding the December 31, 2022 fiscal year filing for Good Gaming Inc, what document type is mentioned in the provided text?
Answer: The document type mentioned is 'Form 10-K/A.'

Question: What is the Commission File Number mentioned in the Regarding the December 31, 2022 fiscal year filing document for Good Gaming, Inc.?
Answer: The Commission File Number mentioned is 000-53949.

Question: Is Good Gaming, Inc. a well-known seasoned issuer according to the document dated November 6, 2023?
Answer: No, according to the document dated November 6, 2023, Good Gaming, Inc. is not a well-known seasoned issuer.

Question: Does the November 6 document state that Good Gaming, Inc. has filed all required reports in the past 12 months as of November 6, 2023?
Answer: Yes, the document states that Good Gaming, Inc. has filed all required reports in the past 12 months as of November 6, 2023.

Question: According to the November 6 10K document, has Good Gaming, Inc. submitted all Interactive Data Files required to be submitted and posted on its corporate website during the past 12 months?
Answer: Yes, according to the document, Good Gaming, Inc. has submitted all Interactive Data Files required to be submitted and posted on its corporate website during the past 12 months.

Question: Is there any disclosure of delinquent filers pursuant to Item 405 of Regulation S-K in the November 6 10K document for Good Gaming Inc?
Answer: No, there is no disclosure of delinquent filers pursuant to Item 405 of Regulation S-K in the document.

Question: According to the November 6 10K document for Good Gaming Inc, what is the aggregate market value of voting and non-voting common equity held by non-affiliates as of June 30, 2022?
Answer: The aggregate market value of voting and non-voting common equity held by non-affiliates as of June 30, 2022, is $4,536,775.

Question: How many shares outstanding of each of the issuer’s classes of common equity are there as of November 1, 2023, according to the November 6 10K document for Good Gaming Inc?
Answer: There are 118,320,144 shares outstanding of each of the issuer’s classes of common equity as of November 1, 2023, according to the document.

Question: What is the purpose of Amendment No. 1 to the Annual Report on the November 6 10K document for Good Gaming Inc,?
Answer: The purpose of Amendment No. 1 to the Annual Report on Form 10-K is solely to include the audit opinion for the period ended December 31, 2021, with the audit opinion for the period ended December 31, 2022. No other changes have been made to the Form 10K.

....

etc

earonesty commented 1 year ago

i copy pasted this: https://www.sec.gov/ix?doc=/Archives/edgar/data/1454742/000149315223039574/form10-ka.htm

OpenAgentsInc / databee