Podcastindex-org / database

16 stars 6 forks source link

podcastGuid is not unique in the db dump (single id maps to multiple different podcasts) #30

Open AmitAronovitch opened 1 year ago

AmitAronovitch commented 1 year ago

I do not know if this is a problem in the dump, the actual assignment of the GUID's, or my understanding of the data, but it seems like the db contains lots of duplication.

sqlite> SELECT COUNT(podcastGuid) number, podcastGuid FROM podcasts GROUP BY podcastGuid ORDER BY number DESC LIMIT 10;
4403|c9c7bad3-4712-514e-9ebd-d1e208fa1b76
169|
84|d9e6a1f6-b3cb-52f8-b4d6-55ae407eb310
68|cb7f498e-3b27-5d94-b342-125314350f98
62|be6f0528-aa42-5049-8198-7ae186dd71d8
61|88d3c2be-c761-5b0d-af98-3f9529fada36
56|768f6d92-769e-5890-9e18-cf35dbb1fbe9
54|f15e059b-d30f-5fbc-a2cd-076260c065a6
52|4749488e-b530-5e96-9ac8-d73d6939a04a
44|31b9658a-eebc-5c9f-9e0d-86adb2473793

In particular, the first podcastGuid in this list seems to be repeatedly assigned to many different shows...

sqlite> SELECT id, itunesId, createdOn, title FROM podcasts WHERE podcastGuid='c9c7bad3-4712-514e-9ebd-d1e208fa1b76' ORDER BY createdOn DESC LIMIT 10;
6412704||1685802762|Buster Brown – Retro Radio Podcast
6412244|1688243540|1685770796|The Real Modern Family
6412119||1685762506|Archaeology Archives – The British History Podcast
6411786||1685742788|Environment – WFHB
6411527||1685725571|Premium Archives | IBCD
6411415||1685718156|12 months of mike – Dystopian Dance Party
6411412||1685718151|jheri curl june – Dystopian Dance Party
6410893||1685678977|booking Archives - Ranking Family Records
6410806||1685674807|Hormones Archives – Green Wisdom Health
6410251||1685641259|Relegation Archives - Learn English Through Football

Note that the latest entries here date to 3/6/2023 (createOn field, converted from timestamp) , which is just 1 day before I collected this db dump. (The oldest one dates to 7/8/2020 - I am not sure what that means)

AmitAronovitch commented 1 year ago

Looks like 657 out of these also have an itunesId, here are the 10 most recent ones:

sqlite> SELECT id, itunesId, createdOn, title FROM podcasts WHERE podcastGuid='c9c7bad3-4712-514e-9ebd-d1e208fa1b76' AND itunesId != '' ORDER BY createdOn DESC LIMIT 10;
6412244|1688243540|1685770796|The Real Modern Family
6393756|1688568693|1684828983|Musica
6393526|1660502054|1684808713|Geek Grills
6351312|1685701690|1683088752|STEPHANIE MILLER SHOW
6347422|1665003735|1682961552|Hope FM UK
6223448|1670155999|1679069953|On The Record
6096235|1675550917|1678274167|Tienda Online Invitada | Ropa y Accesorios | Flow112
6056478|1673263286|1677119877|ポッドキャストでファンづくり
6046057|1672761458|1676772414|James Whale
6044088|1672136346|1676688301|Story All The Way Down