agrignard / WhatsNext

0 stars 1 forks source link

Remove duplicates events from different sources #11

Closed tnguyenh closed 9 months ago

tnguyenh commented 10 months ago

We have to define the process to remove duplicates.

For places for which scrapex scraps the dedicated site (eg: Transbordeur), we can assume that events are fully and correctly listed there. Unless some other sites provide better info, I propose to skip events from other sources: eg, events from Petit Bulletin for Transbordeur will be skipped.

For places with no dedicated site scrapped, some priority order and way to process the events should be defined.

tnguyenh commented 9 months ago

Some issues for merging:

not exactly the same name: image

image

not the same time: image

agrignard commented 9 months ago

How can I help here?

agrignard commented 9 months ago

J'ai fait une grosse passe sur les events généré pour identifier les probleme avec les places y'a à mon avis une discussion à avoir en line avec cette issue https://github.com/agrignard/WhatsNext/issues/18 et surtout celle ci pour essayer d'éliminer le plsu possible les events qui des places qui s'appellent prese pareil https://github.com/agrignard/WhatsNext/issues/17

tnguyenh commented 9 months ago

done

agrignard commented 9 months ago

Well done!! Je suis curieux de savoir quel event tu gardes parmis les duplicates? bien joué regex en tout cas!

agrignard commented 9 months ago

I am checking day by day I notice a duplicate for Marché Gare for this event on the 14/02/2024 https://marchegare.fr/agenda/lankum

tnguyenh commented 9 months ago

Yep. One problem is when the event name is not the same. I can find three references to the same event:

The big issue with the scraping is to deal with error and inconsistencies from other sites.

Until now, the merge process compares the event name strings and find similarities.

agrignard commented 9 months ago

Concert de l'hostel Dieu vs Fugacités at la Rayonne 14/2/2024 appears twice