dimagi / open-chat-studio

A web based platform for building Chatbots backed by Large Language Models
BSD 3-Clause "New" or "Revised" License
13 stars 7 forks source link

WhatsApp data extraction pipelines are not locale aware #221

Open proteusvacuum opened 5 months ago

proteusvacuum commented 5 months ago

When I download a chat from WhatsApp following these instructions the chat is formatted as follows:

2023-12-20, 5:46 p.m. - Farid: 👋🏼

My phone is set to English (Canada)

Setting the phone to English (South Africa) and redownloading the file I get the following format:

2023/12/20, 5:46 pm - Farid: 👋🏼

Feeding these files into the WhatsApp analysis pipeline, I get the following error:

  File "/home/frener/dev/dimagi/open-chat-studio/apps/analysis/steps/parsers.py", line 48, in run
    raise StepError("Unable to parse WhatsApp data")
apps.analysis.exceptions.StepError: Unable to parse WhatsApp data

I think we have to update both the regex: https://github.com/dimagi/open-chat-studio/blob/f118a62d739f4d3aa7253374c1db259b4798f001/apps/analysis/steps/parsers.py#L38 and the time formatter: https://github.com/dimagi/open-chat-studio/blob/f118a62d739f4d3aa7253374c1db259b4798f001/apps/analysis/steps/parsers.py#L57

snopoke commented 5 months ago

Thanks WhatsApp for using such an amazing export format!

snopoke commented 5 months ago

This https://github.com/Pustur/whatsapp-chat-parser/blob/master/src/parser.ts#L11 seems like it could be a good starting point for examples of how this is handled elsewhere.