Closed AtlantisPleb closed 1 year ago
I really like the algo from this site. It's very useful
And have API to use. I guess you can integrate with the GPUtopia as well
anything that uses openai endpoints can use gputopia instead. we support embeddings, inference etc.
Hey, so this is not quite topic modeling but should work for your objective of programmatically creating question/answer training pairs. This is like the prototypical way of doing it.
Going to list out the steps here.
Cool, we're going with that. Awarding 1M sats to @skeptrunedev for the above
Create a list of questions and answers for training a fine-tuned AI on this document fragment, be sure to include the document date, form type and the name of the company in every question or answer.
<FRAGMENT HERE>
Here's what LLAMA came up with on a recent 10K SEC filing document, pasted as raw HTML and it's pretty good... more then good enough, imo.
Question: Regarding the December 31, 2022 fiscal year filing for Good Gaming Inc, what document type is mentioned in the provided text?
Answer: The document type mentioned is 'Form 10-K/A.'
Question: What is the Commission File Number mentioned in the Regarding the December 31, 2022 fiscal year filing document for Good Gaming, Inc.?
Answer: The Commission File Number mentioned is 000-53949.
Question: Is Good Gaming, Inc. a well-known seasoned issuer according to the document dated November 6, 2023?
Answer: No, according to the document dated November 6, 2023, Good Gaming, Inc. is not a well-known seasoned issuer.
Question: Does the November 6 document state that Good Gaming, Inc. has filed all required reports in the past 12 months as of November 6, 2023?
Answer: Yes, the document states that Good Gaming, Inc. has filed all required reports in the past 12 months as of November 6, 2023.
Question: According to the November 6 10K document, has Good Gaming, Inc. submitted all Interactive Data Files required to be submitted and posted on its corporate website during the past 12 months?
Answer: Yes, according to the document, Good Gaming, Inc. has submitted all Interactive Data Files required to be submitted and posted on its corporate website during the past 12 months.
Question: Is there any disclosure of delinquent filers pursuant to Item 405 of Regulation S-K in the November 6 10K document for Good Gaming Inc?
Answer: No, there is no disclosure of delinquent filers pursuant to Item 405 of Regulation S-K in the document.
Question: According to the November 6 10K document for Good Gaming Inc, what is the aggregate market value of voting and non-voting common equity held by non-affiliates as of June 30, 2022?
Answer: The aggregate market value of voting and non-voting common equity held by non-affiliates as of June 30, 2022, is $4,536,775.
Question: How many shares outstanding of each of the issuer’s classes of common equity are there as of November 1, 2023, according to the November 6 10K document for Good Gaming Inc?
Answer: There are 118,320,144 shares outstanding of each of the issuer’s classes of common equity as of November 1, 2023, according to the document.
Question: What is the purpose of Amendment No. 1 to the Annual Report on the November 6 10K document for Good Gaming Inc,?
Answer: The purpose of Amendment No. 1 to the Annual Report on Form 10-K is solely to include the audit opinion for the period ended December 31, 2021, with the audit opinion for the period ended December 31, 2022. No other changes have been made to the Form 10K.
....
etc
Given a text document (PDF, Markdown, etc.) of any length, we need to extract a list of relevant topics, perhaps 5-20 or more.
This is to help us generate question/answer keypairs programmatically from a document which can be used for finetuning a model.
Once we have a list of topics, we can easily query the entire document for each topic for enough context to generate relevant question/answer keypairs via an LLM.
The question is, what is the best way to generate the list of topics?
Conceivably this is what topic modeling is for, but our current implementation using Gensim and LDA gives bad results. We can explore third-party services like this but unsure topic modeling is even the right approach, and prefer to minimize dependence on third parties.
Maybe we should generate topics another way like using semantic search via a service like Vectara or an LLM API like Claude that works with long documents.
Suggestions welcome!
Bounty: 1M sats to whomever describes an algorithm we end up using for this, or 2M sats if it's fully implemented with a PR we can easily merge.