OSU-NLP-Group / TravelPlanner

[ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"
https://osu-nlp-group.github.io/TravelPlanner/
MIT License
215 stars 27 forks source link

commonsense_constraint error #18

Closed yananchen1989 closed 5 months ago

yananchen1989 commented 5 months ago

hi team.

As we see in the code, temperature is zero, but you know, sometimes the response from llm is not deterministic. Therefore, sometimes, though not that often, LLM does not generate a valid plan.

for example, in the number 48 example in validation set, query:

I require a travel itinerary for a seven-day trip beginning on March 2nd and ending on March 8th, 2022. The trip will begin in Philadelphia and involve visiting 3 cities in Virginia. The available budget for the trip is $2,900.

most of time, it generates a valid plan, like this below, which can be parsed into json format for further eval.

Day 1: Current City: Philadelphia Transportation: Self-driving from Philadelphia to Richmond

  • Duration: 4 hours 2 mins
  • Distance: 407 km
  • Cost: $20 Accommodation: Not applicable

Day 2: Current City: Richmond Transportation: Self-driving from Richmond to Jamestown

  • Duration: 1 hour 1 min
  • Distance: 92.4 km
  • Cost: $4 Accommodation: Not applicable

Day 3: Current City: Jamestown Attraction: Historic Jamestowne

  • Address: 1368 Colonial Nat'l Historical Pkwy, Jamestown, VA 23081, USA
  • Phone: (757) 856-1250
  • Website: http://www.historicjamestowne.org/ Lunch: Beyond Breads
  • Average Cost: $95 Dinner: Mama's Fish House
  • Average Cost: $63 Accommodation: Not applicable

Day 4: Current City: Jamestown Transportation: Self-driving from Jamestown to Charlottesville

  • Duration: 2 hours 2 mins
  • Distance: 206 km
  • Cost: $10 Accommodation: Not applicable

Day 5: Current City: Charlottesville Attraction: Monticello

  • Address: 1050 Monticello Loop, Charlottesville, VA 22902, USA
  • Phone: (434) 984-9800
  • Website: https://www.monticello.org/ Lunch: Mama's Fish House
  • Average Cost: $63 Dinner: Restaurant Andre
  • Average Cost: $250 Accommodation: Not applicable

Day 6: Current City: Charlottesville Transportation: Self-driving from Charlottesville to Philadelphia

  • Duration: 4 hours 24 mins
  • Distance: 411 km
  • Cost: $20 Accommodation: Not applicable

Day 7: Current City: Philadelphia Accommodation: Not applicable

However, a few times, under the exactly same prompt, it generates

Apologies, but it seems that the budget provided is insufficient to cover the travel expenses for a seven-day trip involving multiple cities. If you could provide a higher budget, I would be more than happy to assist you in creating a detailed travel itinerary.

and parsed by LLM, it will be json:{"error": "Insufficient budget provided"}

so for this case, it will cause error in eval.py

File "C:\Users\ITSupp\Downloads\codes\TravelPlanner\tools\planner\sole_planning.py", line 169, in scores, detailed_scores = eval_score(args.set_type, tested_plans) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\eval.py", line 80, in eval_score commonsense_info_box = commonsense_eval(query_data,tested_plan['plan']) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\commonsense_constraint.py", line 523, in evaluation return_info['is_reasonalbe_visiting_city'] = is_reasonalbe_visiting_city(query_data, tested_data) File "C:\Users/ITSupp/Downloads/codes/TravelPlanner/evaluation\commonsense_constraint.py", line 134, in is_reasonalbe_visiting_city city_value = tested_data[i]['current_city']

is there any suggestion to not trigger this error and maybe for this type of cases, the eval system directly count them as 0 delivery ? thanks.

hsaest commented 5 months ago

Hi,

This issue also exists in our experiments. However, the best solution we've come up with is to manually revise these instances. Of course, directly counting them as zero is a simpler method. However, identifying these exceptions poses a challenge since the exceptions arising during parsing by LLMs vary greatly. Given the rigorous nature of this research work, we cannot afford to assign a score of zero every time an exception occurs, especially considering the variety of other exceptions that might arise.

yananchen1989 commented 5 months ago

@hsaest very helpful. make sense. thanks.