NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Metadata Improvement]: Identify length limitations for ChatGPT inference #136

Closed gtsueng closed 1 month ago

gtsueng commented 2 months ago

Issue Name

Identify length limitations for ChatGPT inference

Issue Description

ChatGPT has a higher likelihood of hallucinating results for measurementTechnique extraction when the length of 'name'+'description' text is short. Initial observations by Zubair suggests that while this is highly-dependent on the actual text, 15 words seem to be the minimum.

To validate or otherwise improve this estimate:

Issue Discussion

A 50/50 version of this approach was discussed at the internal meeting dated 2024.04.24

Please select the type of metadata improvement

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13

For internal use only. Assignee, please select the status of this issue

Status Description

No response

Request status check list

ZubairQazi commented 2 months ago

Sample spreadsheet:

https://docs.google.com/spreadsheets/d/1lG4hS-PQJ_IRxCc02W3Oz2OFMdYDMURRPbpm3Jg2kk0/edit#gid=1253034259

gtsueng commented 1 month ago

Moved to here to keep measurementTechniques separate from topicCategories: https://docs.google.com/spreadsheets/d/1jkhidFmsp0f_yL8S5wpZ-oBA-eLhQQmESQEq4Lrhx3M/edit#gid=1969630319

gtsueng commented 1 month ago

@ZubairQazi it looks like the thresholds may vary by repo, but I'm not 100% sure since OMICS-DI records outnumber everything else. For the following repositories:

Can you run GPT for measurementTechnique extraction for the following:

If a repo doesn't have at least 5 records that fit the requirements above, just pull however many there are (if any)

ZubairQazi commented 1 month ago

Sample sheet for Dataverse, Mendeley, Zenodo (20 records each) https://docs.google.com/spreadsheets/d/1jkhidFmsp0f_yL8S5wpZ-oBA-eLhQQmESQEq4Lrhx3M/edit#gid=258837660

gtsueng commented 1 month ago

Results of the length check: https://docs.google.com/spreadsheets/d/1crfLDl5_c7jZ47JefhOCf6tx_cM-u8AkK6rsXttM6s8/edit#gid=1639648736

gtsueng commented 1 month ago

This issue has been marked as pending close out and will be closed after a week if there are no additional comments