Due to the limitations of the Power BI editor, for efficiency and versatility, it was decided to initially clean and transform the data in Python and then load this data into Power BI.
An initial separation of the subject and action from the description of each GIF was made using the re module in Python.
The code used to separate subject and action aims to extract the subject (usually the first one or two words) and the action (the remaining part of the description). The extraction logic relies on a simple heuristic using regular expressions, which needs further improvements.
The option of using Natural Language Processing (NLP) libraries is being considered to better separate the subject and action from the description of each GIF.
- Attached to this task you will the updated dataset that was loaded into Power BI.cleaned_gif_data (2).csv