WFO-ID-pilots / text2matrix

0 stars 1 forks source link

Developing desc2matrix.py to convert species descriptions to standardised format #7

Closed yjkiwilee closed 1 month ago

yjkiwilee commented 2 months ago

@nickynicolson

I'm writing a test script (desc2matrix.py) to convert plant descriptions to a standardised data format containing the characteristic values using Ollama. I'm hoping to eventually include this within the Makefile to automatically convert all or a subset of the descriptions.

The script currently asks the 'desc2matrix' model (adapted from llama3, but with reduced temperature) to convert two species descriptions into a single JSON.

Please see #8 for a related issue.

yjkiwilee commented 2 months ago

Structure

Can you switch your program to use the argparse library for command line arguments so that input like the temperature, model, system and user prompts, and the botanical descriptions can be passed in as command like arguments (either in total or in files)

ollama API

Previously I have used client.generate which gives you the option of explicitly passing in the system prompt as an argument.

Hi @nickynicolson , thank you for your comment! I've edited the script so that it handles command line arguments and updated it to use client.generate instead of client.chat. I've also specified the seed so that the model consistently generates the same response.

yjkiwilee commented 2 months ago

@nickynicolson desc2matrix.py can now take in description output files from dwca2csv.py. You can check the arguments that I have added, but one of them is the '--mode' option which you can use to choose between 'desc2json' (description to JSON in a single LLM run) and 'desc2list2json' (description to JSON across two LLM runs). The output format is a list of JSON where each description has 'coreid' and 'original_description' for the taxon id and the original description, 'char_list' for the LLM-generated bullet-pointed list string, and 'char_json' where the JSON-ified characteristics are stored. Each characteristic is formatted as {'characteristic':'', 'value':''}. See below for up-to-date descrpition of output file structure.

I tried including a request to justify each characteristic against the given rules, but that produced output that was too inconsistent and didn't necessarily improve the performance. I will likely come back to this idea at a later stage when I'm fine-tuning the model at a later stage, but please let me know if you do any prompt engineering using llama3 yourself. In general, feel free to make more change requests or ask any questions. Thank you!

yjkiwilee commented 1 month ago

Each description in the output file from desc2matrix.py is now structured as follows:

{
  "coreid": WFO taxon id,
  "status": one of "success", "bad_structure" (LLM output is valid JSON but has bad structure), "invalid_json",
  "original_description": original description text in the input description file,
  "char_list": bullet-pointed list produced by LLM if "desc2list2json" mode is chosen,
  "char_json": [
    {"characteristic": characteristic name, "value": characteristic value},
    ...
  ]
}

An example JSON output file, generated by running python3 desc2matrix.py data/solanaceae-desc.txt test_scripts/desc2list2json_out.json --mode=desc2list2json --desctype=general --spnum=4 can be found here: desc2list2json_out.json

yjkiwilee commented 1 month ago

@nickynicolson I've added desc2matrix_manual.md to serve as a user manual for desc2matrix.py. Please refer to this Markdown document for future changes to its operation.

yjkiwilee commented 1 month ago

Pull request closed as these scripts are replaced by those in this pull req (https://github.com/WFO-ID-pilots/text2matrix/pull/15).