commonsense_constraint error

yananchen1989 commented 5 months ago

hi team.

As we see in the code, temperature is zero, but you know, sometimes the response from llm is not deterministic. Therefore, sometimes, though not that often, LLM does not generate a valid plan.

for example, in the number 48 example in validation set, query:

I require a travel itinerary for a seven-day trip beginning on March 2nd and ending on March 8th, 2022. The trip will begin in Philadelphia and involve visiting 3 cities in Virginia. The available budget for the trip is $2,900.

most of time, it generates a valid plan, like this below, which can be parsed into json format for further eval.

Day 1: Current City: Philadelphia Transportation: Self-driving from Philadelphia to Richmond

Duration: 4 hours 2 mins

Distance: 407 km

Cost: $20 Accommodation: Not applicable

Day 2: Current City: Richmond Transportation: Self-driving from Richmond to Jamestown

Duration: 1 hour 1 min

Distance: 92.4 km

Cost: $4 Accommodation: Not applicable

Day 3: Current City: Jamestown Attraction: Historic Jamestowne

Address: 1368 Colonial Nat'l Historical Pkwy, Jamestown, VA 23081, USA

Phone: (757) 856-1250

Website: http://www.historicjamestowne.org/ Lunch: Beyond Breads

Average Cost: $95 Dinner: Mama's Fish House

Average Cost: $63 Accommodation: Not applicable

Day 4: Current City: Jamestown Transportation: Self-driving from Jamestown to Charlottesville

Duration: 2 hours 2 mins

Distance: 206 km

Cost: $10 Accommodation: Not applicable

Day 5: Current City: Charlottesville Attraction: Monticello

Address: 1050 Monticello Loop, Charlottesville, VA 22902, USA

Phone: (434) 984-9800

Website: https://www.monticello.org/ Lunch: Mama's Fish House

Average Cost: $63 Dinner: Restaurant Andre

Average Cost: $250 Accommodation: Not applicable

Day 6: Current City: Charlottesville Transportation: Self-driving from Charlottesville to Philadelphia

Duration: 4 hours 24 mins

Distance: 411 km

Cost: $20 Accommodation: Not applicable

Day 7: Current City: Philadelphia Accommodation: Not applicable

However, a few times, under the exactly same prompt, it generates

Apologies, but it seems that the budget provided is insufficient to cover the travel expenses for a seven-day trip involving multiple cities. If you could provide a higher budget, I would be more than happy to assist you in creating a detailed travel itinerary.

and parsed by LLM, it will be json:{"error": "Insufficient budget provided"}

so for this case, it will cause error in eval.py

File "C:\Users\ITSupp\Downloads\codes\TravelPlanner\tools\planner\sole_planning.py", line 169, in scores, detailed_scores = eval_score(args.set_type, tested_plans) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\eval.py", line 80, in eval_score commonsense_info_box = commonsense_eval(query_data,tested_plan['plan']) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\commonsense_constraint.py", line 523, in evaluation return_info['is_reasonalbe_visiting_city'] = is_reasonalbe_visiting_city(query_data, tested_data) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\commonsense_constraint.py", line 134, in is_reasonalbe_visiting_city city_value = tested_data[i]['current_city']

is there any suggestion to not trigger this error and maybe for this type of cases, the eval system directly count them as 0 delivery ? thanks.

hsaest commented 5 months ago

Hi,

This issue also exists in our experiments. However, the best solution we've come up with is to manually revise these instances. Of course, directly counting them as zero is a simpler method. However, identifying these exceptions poses a challenge since the exceptions arising during parsing by LLMs vary greatly. Given the rigorous nature of this research work, we cannot afford to assign a score of zero every time an exception occurs, especially considering the variety of other exceptions that might arise.

yananchen1989 commented 5 months ago

@hsaest very helpful. make sense. thanks.

OSU-NLP-Group / TravelPlanner

commonsense_constraint error #18