Open gtsueng opened 8 months ago
Pricing estimate for 3,000,000 data points is between $2000 to $3000. Fluctuation is due to a few factors, but mainly that failed API requests are still charged and that inputs vary in token sizes.
@ZubairQazi Please do an initial run with the following two repositories:
There should be about 100-120 records total. These repositories are "themed" so we can check the conceptual similarity against the themes of the repository to ensure that GPT3.5 turbo is performing as expected:
Themes/Topics for each repository:
Please dump the results and the metrics into a google spreadsheet for review.
Posting link for future reference:
The results for ClinEpiDB and VDJ Server look good. We will proceed to scale up the application to ~1000 records by applying them to the following two repositories:
The results from LINCS and ImmPort also look good, though there appears to be a strange formatting issue for a small number of results. Next steps:
The sample data was generated on 2024.03.06 and can be found here for evaluation
@ZubairQazi Can you do a topicCategory run for the new repos that @DylanWelzel recently added?
Additionally, please work with @DylanWelzel to ensure that any new records added during the updates since the original topicCategory run also get topicCategories assigned to them.
@ZubairQazi please also include
Discussion on slack: https://suwulab.slack.com/archives/CPF4AEHHC/p1696874796709359 Andrew's GH repository: https://github.com/andrewsu/openai_topic_classification
Chunlei's API key should be used to facilitate project billing
Details: The use of the more pricey ChatGPT4 is not necessary as ChatGPT3.5 appears to perform this task sufficiently
Potential testing/evaluation subset: Records with citation.pmid