Koeng101 / dnadesign

A Go package for designing DNA.
Other
23 stars 0 forks source link

Wikipedia paperqa #26

Closed Koeng101 closed 9 months ago

Koeng101 commented 9 months ago

Something I've been thinking about is how to prove out the usefulness of the "synthetic biology oracle" I'd like to build with dnadesign. I think a good basic way would be to implement a vector database (using SQLite-vss) on wikipedia data, which is only about 22gb, and can be encoded in way less(though this is less relevant, because it'll be SQLite anyway).

There is a lot of basic knowledge on wikipedia that could be useful to pull from and read. For example, if a user wanted to know how blue-white screening works, a quick search of wikipedia would likely be better than a search of all scientific literature.

The idea would be to pull from the wikipedia dumps and create a FAISS database against the whole article. These would then be available over an API. This data would be used to get information, similar to how paperqa works. Except, instead of running python code, LLMs would be instructed to write lua code which is then able to do the summarization and such. Essentially, we want to embed the ability of the LLMs to run different kinds of summarization pipelines themselves, ie, recursion.

Here is some scratch lua code approximating what I'm thinking about:

question = "What does lacZ do?"
wikipedia_entries = search_wikipedia("lacZ function") -- vector search
uniprot_entries = search_uniprot("lacZ function") -- vector search
relevant_entries = select_sources(question, {wikipedia_entries, uniprot_entries}) -- have an LLM sort for most relevant papers
answers = answer_question(question, relevant_entries) -- have an LLM directly answer using relevant entries
full_summary = summarize(question, answers) -- summarize all information from all answers

print(full_summary) -- return summary, which can be used downstream

For example, the select_sources function is actually generating lua code based on the answer that it finds for the paper, but that lua code can then initiate another search.

Koeng101 commented 9 months ago

Closing now to focus on the current PRs. may open later.