Ads97 / WhatsApp-Llama

Finetune a LLM to speak like you based on your WhatsApp Conversations
Other
342 stars 10 forks source link

Update preprocessing.py: Refactor and optimize #1

Closed stark1tty closed 1 year ago

stark1tty commented 1 year ago

Pls check and make sure it works, it does for me (fingers crossed). I'm a hobbyist and still learning. I was only able to export one message with the original script.


What does this PR do?

Changes made:

  1. Simplified remove_placeholders method:

    • Refactored the logic to use a more concise list comprehension for checking if any placeholder phrase exists in a message.
  2. Enhanced error handling in get_user_text:

    • Introduced better exception handling with more informative print statements for cases where message splitting fails.
  3. Refactored collate_messages method:

    • Replaced the while-loop-based message aggregation with a more intuitive for-loop, eliminating the need for pointer-style management (fp, sp).
    • Removed the redundant clean_text function, as its functionality was seamlessly integrated into the refactored collate method.
    • Optimized the logic to correctly collate multi-line WhatsApp messages and separate messages from different users. The refactored logic is simpler and more efficient.
    • Ensured that the last message in the conversation is captured, which was possibly missed in the previous implementation.
  4. Removed the unnecessary import os.

  5. Added JSON pretty-printing:

    • Modified the JSON dump method to include indentation, which improves the readability of the output file.
  6. Improved readability:

    • Made several cosmetic changes to improve code readability and consistency, such as consistent spacing and indentation.

Rationale:

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced. Please also list any relevant details for your test configuration.

Before submitting

Thanks for contributing 🎉!

stark1tty commented 1 year ago

polishing, maybe - seems its not formatting right. there's a lot of problems.