NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Augmentation] add topicCategory values using ChatGPT #114

Open gtsueng opened 8 months ago

gtsueng commented 8 months ago

Discussion on slack: https://suwulab.slack.com/archives/CPF4AEHHC/p1696874796709359 Andrew's GH repository: https://github.com/andrewsu/openai_topic_classification

Chunlei's API key should be used to facilitate project billing

Details: The use of the more pricey ChatGPT4 is not necessary as ChatGPT3.5 appears to perform this task sufficiently

Potential testing/evaluation subset: Records with citation.pmid

ZubairQazi commented 4 months ago

Pricing estimate for 3,000,000 data points is between $2000 to $3000. Fluctuation is due to a few factors, but mainly that failed API requests are still charged and that inputs vary in token sizes.

gtsueng commented 4 months ago

@ZubairQazi Please do an initial run with the following two repositories:

There should be about 100-120 records total. These repositories are "themed" so we can check the conceptual similarity against the themes of the repository to ensure that GPT3.5 turbo is performing as expected:

Themes/Topics for each repository:

Please dump the results and the metrics into a google spreadsheet for review.

ZubairQazi commented 4 months ago

Posting link for future reference:

https://docs.google.com/spreadsheets/d/1lG4hS-PQJ_IRxCc02W3Oz2OFMdYDMURRPbpm3Jg2kk0/edit#gid=1557475722

gtsueng commented 4 months ago

The results for ClinEpiDB and VDJ Server look good. We will proceed to scale up the application to ~1000 records by applying them to the following two repositories:

gtsueng commented 4 months ago

The results from LINCS and ImmPort also look good, though there appears to be a strange formatting issue for a small number of results. Next steps:

gtsueng commented 3 months ago

The sample data was generated on 2024.03.06 and can be found here for evaluation

gtsueng commented 2 weeks ago

@ZubairQazi Can you do a topicCategory run for the new repos that @DylanWelzel recently added?

Additionally, please work with @DylanWelzel to ensure that any new records added during the updates since the original topicCategory run also get topicCategories assigned to them.

gtsueng commented 5 days ago

@ZubairQazi please also include