[Metadata Improvement]: Identify length limitations for ChatGPT inference

gtsueng commented 2 months ago

Issue Name

Identify length limitations for ChatGPT inference

Issue Description

ChatGPT has a higher likelihood of hallucinating results for measurementTechnique extraction when the length of 'name'+'description' text is short. Initial observations by Zubair suggests that while this is highly-dependent on the actual text, 15 words seem to be the minimum.

To validate or otherwise improve this estimate:

[x] Identify 25 records with name+description word counts of <10 words
[x] Identify 25 records with name+description word counts of 11-20 words
[x] Identify 25 records with name+description word counts of 21-30 words
[x] Identify 25 records with name+description word counts of 31-40 words
[x] Use ChatGPT to pull the measurementTechniques for these 100 records and evaluate the results in terms of whether or not a GPT predicted technique could reasonably be identified from the name+description or if it's a hallucination

Issue Discussion

A 50/50 version of this approach was discussed at the internal meeting dated 2024.04.24

Please select the type of metadata improvement

[ ] Standardization (normalizing free text to an ontology)
[X] Augmentation (adding values for metadata fields missing values)
[ ] Clean up (addressing redundancy or messy metadata)
[ ] Structure (changing the structuring of the metadata to support front end UI features)

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13

For internal use only. Assignee, please select the status of this issue

[ ] Not yet started
[ ] In progress
[ ] Blocked
[ ] Will not address

Status Description

No response

Request status check list

[ ] This metadata improvement has yet to be discussed between NIAID, Scripps, Leidos
[X] This metadata improvement does not need to be discussed between NIAID, Scripps, Leidos
[ ] This metadata improvement has been discussed/reported between NIAID, Scripps, Leidos
[ ] This metadata improvement has been implemented locally to generate data for review
[ ] This metadata improvement has been implemented on Dev
[ ] This metadata improvement has been implemented on Dev and the results have been reviewed and approved for staging
[ ] This metadata improvement has been implemented on Staging
[ ] This page/documentation/change has been approved for Production
[ ] This page/documentation/change has been implemented on Production

ZubairQazi commented 2 months ago

Sample spreadsheet:

https://docs.google.com/spreadsheets/d/1lG4hS-PQJ_IRxCc02W3Oz2OFMdYDMURRPbpm3Jg2kk0/edit#gid=1253034259

gtsueng commented 1 month ago

Moved to here to keep measurementTechniques separate from topicCategories: https://docs.google.com/spreadsheets/d/1jkhidFmsp0f_yL8S5wpZ-oBA-eLhQQmESQEq4Lrhx3M/edit#gid=1969630319

gtsueng commented 1 month ago

@ZubairQazi it looks like the thresholds may vary by repo, but I'm not 100% sure since OMICS-DI records outnumber everything else. For the following repositories:

Harvard Dataverse
Mendeley
Zenodo

Can you run GPT for measurementTechnique extraction for the following:

5 records/repo with name+description word counts of <10 words
5 records/repo with name+description word counts of 11-20 words
5 records/repo with name+description word counts of 21-30 words
5 records/repo with name+description word counts of 31-40 words

If a repo doesn't have at least 5 records that fit the requirements above, just pull however many there are (if any)

ZubairQazi commented 1 month ago

Sample sheet for Dataverse, Mendeley, Zenodo (20 records each) https://docs.google.com/spreadsheets/d/1jkhidFmsp0f_yL8S5wpZ-oBA-eLhQQmESQEq4Lrhx3M/edit#gid=258837660

gtsueng commented 1 month ago

Results of the length check: https://docs.google.com/spreadsheets/d/1crfLDl5_c7jZ47JefhOCf6tx_cM-u8AkK6rsXttM6s8/edit#gid=1639648736

gtsueng commented 1 month ago

This issue has been marked as pending close out and will be closed after a week if there are no additional comments

NIAID-Data-Ecosystem / nde-crawlers