dodeeric / langchain-ai-assistant-with-hybrid-rag

See here https://github.com/dodeeric/ragai-agent the agentic (agent) version of this assistant.
https://github.com/dodeeric/ragai-agent
GNU General Public License v3.0
13 stars 4 forks source link

scrape commons categories no more ok for many files: text field empty! #102

Closed dodeeric closed 3 months ago

dodeeric commented 3 months ago

the name of the class/filter has changed in wikimedia commons! and not always the same...


In the json file (nok):

https://commons.wikimedia.org/wiki/File:Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%27archiduc_Rodolphe_d%27Autriche_le_10_mai_1881-The_Graphic.jpg ==> NOK

Because 1881-The in place of 1881_-_The

Copy/paste the correct url from the browser (ok):

https://commons.wikimedia.org/wiki/File:Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%E2%80%99archiduc_Rodolphe_d%E2%80%99Autriche_le_10_mai_1881.jpg ==> OK

==> %27 = ' ==> %E2%80%99 = ’


Category: Northcliffe Beach

1) Text field NOK (empty)

{ "url": "https://commons.wikimedia.org/wiki/File:Northcliffe_Beach,_Surfers_Paradise,_Queensland_08.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Northcliffe_Beach%2C_Surfers_Paradise%2C_Queensland_08.jpg/640px-Northcliffe_Beach%2C_Surfers_Paradise%2C_Queensland_08.jpg", "og:image:width": "640", "og:image:height": "480", "og:title": "File:Northcliffe Beach, Surfers Paradise, Queensland 08.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" },

2) Text field OK

{ "url": "https://commons.wikimedia.org/wiki/File:Northcliffe_SLSC,_Northcliffe_Beach,_Surfers_Paradise,_Queensland_04.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Northcliffe_SLSC%2C_Northcliffe_Beach%2C_Surfers_Paradise%2C_Queensland_04.jpg/640px-Northcliffe_SLSC%2C_Northcliffe_Beach%2C_Surfers_Paradise%2C_Queensland_04.jpg", "og:image:width": "640", "og:image:height": "480", "og:title": "File:Northcliffe SLSC, Northcliffe Beach, Surfers Paradise, Queensland 04.jpg - Wikimedia Commons", "og:type": "website" }, "text": "\n\n\nDescriptionNorthcliffe SLSC, Northcliffe Beach, Surfers Paradise, Queensland 04.jpg\n\nEnglish: Northcliffe Surf Lifesaving Club Watch Tower, Northcliffe Beach, Surfers Paradise, Queensland\n\n\nDate\n\n11 August 2013, 12:43:53\n\n\nSource\n\nOwn work\n\n\nAuthor\n\nKgbo\n\n\n"

dodeeric commented 3 months ago

[ { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier,_Palais_de_la_Nation,_juillet_1856.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Palais_de_la_Nation%2C_juillet_1856.jpg/640px-25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Palais_de_la_Nation%2C_juillet_1856.jpg", "og:image:width": "640", "og:image:height": "959", "og:title": "File:25e anniversaire de l'inauguration du roi Léopold Ier, Palais de la Nation, juillet 1856.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier,_Place_royale,_juillet_1856_-_Hymans.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg/640px-25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg", "og:image:width": "640", "og:image:height": "932", "og:title": "File:25e anniversaire de l'inauguration du roi Léopold Ier, Place royale, juillet 1856 - Hymans.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier,_Place_royale,_juillet_1856.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856.jpg/640px-25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856.jpg", "og:image:width": "640", "og:image:height": "826", "og:title": "File:25e anniversaire de l'inauguration du roi Léopold Ier, Place royale, juillet 1856.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_-_Hymans.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_-_Hymans.jpg/640px-25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_-_Hymans.jpg", "og:image:width": "640", "og:image:height": "947", "og:title": "File:25e anniversaire de l’inauguration du roi Léopold Ier - Hymans.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_le_22_juillet_1856.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_le_22_juillet_1856.jpg/640px-25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier_le_22_juillet_1856.jpg", "og:image:width": "640", "og:image:height": "436", "og:title": "File:25e anniversaire de l’inauguration du roi Léopold Ier le 22 juillet 1856.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier.jpg/640px-25e_anniversaire_de_l%E2%80%99inauguration_du_roi_L%C3%A9opold_Ier.jpg", "og:image:width": "640", "og:image:height": "972", "og:title": "File:25e anniversaire de l’inauguration du roi Léopold Ier.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%27archiduc_Rodolphe_d%27Autriche_le_10_mai_1881_-_The_Graphic.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%27archiduc_Rodolphe_d%27Autriche_le_10_mai_1881_-_The_Graphic.jpg/640px-Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%27archiduc_Rodolphe_d%27Autriche_le_10_mai_1881_-_The_Graphic.jpg", "og:image:width": "640", "og:image:height": "474", "og:title": "File:Accueil de la princesse Stéphanie de Belgique par le bourgmestre lors de son mariage avec l'archiduc Rodolphe d'Autriche le 10 mai 1881 - The Graphic.jpg - Wikimedia Commons", "og:type": "website" }, "text": "\n\n\n\nArtist\n\n\nsigned (bottom left)\n\n\nAuthor\n\n\nThe Graphic\n\n\nDescription\n\n\nEnglish: The Royal Wedding in Austria. Illustration for The Graphic, 21 May 1881. reference\nPrincess Stéphanie of Belgium is welcomed by the mayor during her marriage to Archduke Rudolph of Austria, Vienna (Austria), 10 May 1881.\nFrançais : Accueil de la princesse Stéphanie de Belgique par le bourgmestre lors de son mariage avec l’archiduc Rodolphe d’Autriche, Vienne (Autriche), le 10 mai 1881.\nNederlands: Prinses Stefanie van België wordt verwelkomd door de burgemeester tijdens haar huwelijk met aartshertog Rudolf van Oostenrijk, Wenen (Oostenrijk), 10 mei 1881.\n\n\nDate\n\n21 May 1881date QS:P571,+1881-05-21T00:00:00Z/11\n\n\nSource/Photographer\n\nThe Graphic (United Kingdom), 21 May 1881, p. 492.\n\nTechniqueInfoFieldwood engraving print\n\nSizeInfoFieldheight: 17 cm (6.6 in); width: 23 cm (9 in)dimensions QS:P2048,17U174728dimensions QS:P2049,23U174728\n\nInscriptionsInfoField\n\nOriginal captionInfoFieldThe royal wedding in Austria. State entry of the Princess Stéphanie into Vienna. The burgomaster presenting the address of welcome.\n\nScanned byInfoFieldÉric Dodémont (Dodeeric)\n\n\n" }, { "url": "https://commons.wikimedia.org/wiki/File:Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%E2%80%99archiduc_Rodolphe_d%E2%80%99Autriche_le_10_mai_1881.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%E2%80%99archiduc_Rodolphe_d%E2%80%99Autriche_le_10_mai_1881.jpg/640px-Accueil_de_la_princesse_St%C3%A9phanie_de_Belgique_par_le_bourgmestre_lors_de_son_mariage_avec_l%E2%80%99archiduc_Rodolphe_d%E2%80%99Autriche_le_10_mai_1881.jpg", "og:image:width": "640", "og:image:height": "451", "og:title": "File:Accueil de la princesse Stéphanie de Belgique par le bourgmestre lors de son mariage avec l’archiduc Rodolphe d’Autriche le 10 mai 1881.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Anniversaire_de_la_mort_d%E2%80%99Adolphe_Thiers.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Anniversaire_de_la_mort_d%E2%80%99Adolphe_Thiers.jpg/640px-Anniversaire_de_la_mort_d%E2%80%99Adolphe_Thiers.jpg", "og:image:width": "640", "og:image:height": "438", "og:title": "File:Anniversaire de la mort d’Adolphe Thiers.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Antoine_Clesse_-_Po%C3%A8te_et_chansonnier_belge.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Antoine_Clesse_-_Po%C3%A8te_et_chansonnier_belge.jpg/640px-Antoine_Clesse_-_Po%C3%A8te_et_chansonnier_belge.jpg", "og:image:width": "640", "og:image:height": "777", "og:title": "File:Antoine Clesse - Poète et chansonnier belge.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Archiduc_Rodolphe,_prince_h%C3%A9ritier_d%27Autriche_et_de_Hongrie.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Archiduc_Rodolphe%2C_prince_h%C3%A9ritier_d%27Autriche_et_de_Hongrie.jpg/640px-Archiduc_Rodolphe%2C_prince_h%C3%A9ritier_d%27Autriche_et_de_Hongrie.jpg", "og:image:width": "640", "og:image:height": "890", "og:title": "File:Archiduc Rodolphe, prince héritier d'Autriche et de Hongrie.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Archiduchesse_%C3%89lisabeth-Marie_d%27Autriche.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Archiduchesse_%C3%89lisabeth-Marie_d%27Autriche.jpg/640px-Archiduchesse_%C3%89lisabeth-Marie_d%27Autriche.jpg", "og:image:width": "640", "og:image:height": "677", "og:title": "File:Archiduchesse Élisabeth-Marie d'Autriche.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_et_Marie-Henriette_%C3%A0_Spa.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Arriv%C3%A9e_de_L%C3%A9opold_et_Marie-Henriette_%C3%A0_Spa.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_et_Marie-Henriette_%C3%A0_Spa.jpg", "og:image:width": "640", "og:image:height": "429", "og:title": "File:Arrivée de Léopold et Marie-Henriette à Spa.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_Ier_%C3%A0_la_fronti%C3%A8re_belge_le_17_juillet_1831.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/25/Arriv%C3%A9e_de_L%C3%A9opold_Ier_%C3%A0_la_fronti%C3%A8re_belge_le_17_juillet_1831.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_Ier_%C3%A0_la_fronti%C3%A8re_belge_le_17_juillet_1831.jpg", "og:image:width": "640", "og:image:height": "385", "og:title": "File:Arrivée de Léopold Ier à la frontière belge le 17 juillet 1831.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Smeeton.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Smeeton.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Smeeton.jpg", "og:image:width": "640", "og:image:height": "442", "og:title": "File:Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865 - Smeeton.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen_-_Rod.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen_-_Rod.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen_-_Rod.jpg", "og:image:width": "640", "og:image:height": "380", "og:title": "File:Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865 - Verdyen - Rod.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865_-_Verdyen.jpg", "og:image:width": "640", "og:image:height": "369", "og:title": "File:Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865 - Verdyen.jpg - Wikimedia Commons", "og:type": "website" }, "text": "\n\n\nDescriptionArrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865 - Verdyen.jpg\n\nFrançais : Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement, Bruxelles, le 17 décembre 1865.\n\n\nDate\n\n1888\n\n\nSource\n\nThéodore Juste, Histoire de Belgique, tome III, Bruylant-Christophe, Bruxelles, 1888, p. 317.\n\n\nAuthor\n\ndrawer:\n\n\n\nEugène Verdyen\n (1836–1903) \n\n\nAlternative names\n\nEugène Verdijen\n\nDescription\nBelgian painter and drawer\n\nDate of birth/death\n\n29 August 1836 \n17 June 1903 \n\nLocation of birth/death\n\nLiège\nCity of Brussels\n\nAuthority file\n\n: Q26235979\nVIAF: 96356968\nULAN: 500092395\nLCCN: no2017155663\nGND: 1137839317\nSUDOC: 203177894\nWorldCat\n\n\n\ncreator QS:P170,Q26235979\nengraver: Ch. Rod.\n\nAfter (photo):\n\n\n\nGhémar Frères studio\n (fl. 1882) \n\n\nAlternative names\n\nGhemar Frères Atelier de Photographie\n\nDescription\nBelgianGhémar Frères, Photographes du Roi, 27, Rue de l'Ecuyer, Bruxelles.\nEnglish: The photo studio of the two brothers Ghémar was the most renowned Belgian photostudio in the period 1855–1870. \nIn 1855 Louis Ghémar (1820–1873)[1] opened a photostudio in Brussels, next to the studio of Jules Géruzet. \nIn the beginning Louis Ghémar worked together with Robert Sévérin. Sévérin took the photos and Ghémar made the retouches and eventually colours the photos. But Sévérin left Brussels and was replaced by Louis Ghémars halfbrother, Léon Auverlaux. From then on the name of the studio was changed to Ghémar Frères. \n\nLouis Ghémar died in 1873 but the studio kept the name Ghémar Frères until 1894 when Géruzet takes over the studio, including all the negatives of Ghémar.[2]\n\nWork period\nbetween 1855 and 1894date QS:P,+1850-00-00T00:00:00Z/7,P1319,+1855-00-00T00:00:00Z/9,P1326,+1894-00-00T00:00:00Z/9\n\nWork location\n\nCity of Brussels\n\nAuthority file\n\n: Q21557453\nVIAF: 158671009\nLCCN: no2007070950\nBNF: 15346099j\nRKD: 417254\nWorldCat\n\n\n\ncreator QS:P170,Q21557453\n\n\nOther versions\n\n\n\n\nFile:Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865 - Verdyen - Rod.jpg\n\n\n\n\nTechniqueInfoFieldwood engraving print\n\nSizeInfoFieldheight: 7 cm (2.7 in); width: 11 cm (4.3 in)dimensions QS:P2048,7U174728dimensions QS:P2049,11U174728\n\nInscriptionsInfoFieldSignature bottom left: \nVERDYEN\n\nSignature bottom right: \n\nCH.ROD\n\n\nOriginal captionInfoFieldEntrée solennelle à Bruxelles de Léopold II (17 décembre 1865).\n\nScanned byInfoFieldÉric Dodémont (Dodeeric)\n\n\n" }, { "url": "https://commons.wikimedia.org/wiki/File:Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/db/Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865.jpg/640px-Arriv%C3%A9e_de_L%C3%A9opold_II_devant_le_th%C3%A9%C3%A2tre_de_la_Monnaie_le_jour_de_son_av%C3%A8nement_le_17_d%C3%A9cembre_1865.jpg", "og:image:width": "640", "og:image:height": "439", "og:title": "File:Arrivée de Léopold II devant le théâtre de la Monnaie le jour de son avènement le 17 décembre 1865.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" },

dodeeric commented 3 months ago

ok and nok, even if twice the exact same page URL!

new json: text field nok

{ "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier,_Place_royale,_juillet_1856_-_Hymans.jpg", "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg/640px-25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg", "og:image:width": "640", "og:image:height": "932", "og:title": "File:25e anniversaire de l'inauguration du roi Léopold Ier, Place royale, juillet 1856 - Hymans.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" },

old json: text field ok

sqlite> select * from embedding_metadata where string_value like '%Place_royale,_juillet1856-_Hymans.jpg%';

7|chroma:document| { "url": "https://commons.wikimedia.org/wiki/File:25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier,_Place_royale,_juillet_1856_-_Hymans.jpg", "metadata": {"og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg/640px-25e_anniversaire_de_l%27inauguration_du_roi_L%C3%A9opold_Ier%2C_Place_royale%2C_juillet_1856_-_Hymans.jpg", "og:image:width": "640", "og:image:height": "932", "og:title": "File:25e anniversaire de l'inauguration du roi L\u00e9opold Ier, Place royale, juillet 1856 - Hymans.jpg - Wikimedia Commons", "og:type": "website"}, "text": "\n\n\nDescription25e anniversaire de l'inauguration du roi L\u00e9opold Ier, Place royale, juillet 1856 - Hymans.jpg\n\nFran\u00e7ais\u00a0: 25e anniversaire de l\u2019inauguration du roi

dodeeric commented 3 months ago

"hproduct commons-file-information-table" no more present on all pages!

new json

text field nok:

{ "url": "https://commons.wikimedia.org/wiki/File:Contele_Filip_DeFlandra.jpg", ===> fileinfotpl-type-information vevent mw-content-ltr is present in place of the old class! "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/5/5f/Contele_Filip_DeFlandra.jpg", "og:image:width": "640", "og:image:height": "893", "og:title": "File:Contele Filip DeFlandra.jpg - Wikimedia Commons", "og:type": "website" }, "text": "" },

text field ok:

{ "url": "https://commons.wikimedia.org/wiki/File:De_Graaf_van_Vlaanderen.jpg", ===> hproduct commons-file-information-table old class is still present "metadata": { "og:image": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/De_Graaf_van_Vlaanderen.jpg/640px-De_Graaf_van_Vlaanderen.jpg", "og:image:width": "640", "og:image:height": "917", "og:title": "File:De Graaf van Vlaanderen.jpg - Wikimedia Commons", "og:type": "website" }, "text": "\n\n\n\nLievin de Winne: Prince Philippe, count of Flanders\n \n\n\nArtist\n\n\n\n\n\n\nLievin de Winne\n (1821–1880) \n\n\nAlternative names\n\nLeévin de Winne; Lievin De Winne; Liévin De Winne; Liévin de Winne; Leevin de Winne; Liéven De Winne\n\nDescription\nBelgian painter\n\nDate of birth/death\n\n24 January 1821 \n13 May 1880 \n\nLocation of birth/death\n\nGhent\nCity of Brussels\n\nWork location\n\nGhent, Paris (1852-1855)\n\nAuthority file\n\n: Q713128\nVIAF: 95959495\nISNI: 0000000069134913\nULAN: 500042641\nGND: 174289618\nRKD: 21172\n\n\n\nartist QS:P170,Q713128\n\n\nTitle\n\n\nDutch: Portret van Filip, graaf van Vlaanderen prince Philippe, count of Flanderslabel QS:Len,\"prince Philippe, count of Flanders\"\n\n\nObject type\n\npaintingobject_type QS:P31,Q3305213\n\n\nGenre\n\nportrait \n\n\nMedium\n\noilmedium QS:P186,Q296955\n\n\nDimensions\n\nheight: 133 cm (52.3 in); width: 91 cm (35.8 in)dimensions QS:P2048,133U174728dimensions QS:P2049,91U174728\n\n\nCollection\n\n\n\n\nRoyal Collection of Belgium\n \n\n\n\n\n\nNative name\nKoninklijke Verzameling\n\nLocation\nCity of Brussels\n\nWebsite\nwww.monarchie.be/de/monarchie/zivilliste/konigliche-sammlung \n\nAuthority file\n\n: Q2536986\n\n\n\ninstitution QS:P195,Q2536986\n\n\nAccession number\n\n\n2707 (Royal Collection of Belgium) \n\n\nReferences\n\nBALaT object ID: 20023138 \n\n\nSource/Photographer\n\nUnknown sourceUnknown source\n\n\n" },

dodeeric commented 3 months ago

A) class "hproduct commons-file-information-table" is present on both following pages: https://commons.wikimedia.org/wiki/File:Contele_Filip_DeFlandra.jpg ==> text field nok (empty) https://commons.wikimedia.org/wiki/File:De_Graaf_van_Vlaanderen.jpg ==> text field ok

B) class "mw-content-ltr mw-parser-output" (summary block) is present twice on both following pages (same as above): https://commons.wikimedia.org/wiki/File:Contele_Filip_DeFlandra.jpg ==> text field ok https://commons.wikimedia.org/wiki/File:De_Graaf_van_Vlaanderen.jpg ==> text field ok

dodeeric commented 3 months ago

scraping of "Engravings by Dodeeric" and "Northcliffe Beach" categories are again ok after changing filter/class from "hproduct commons-file-information-table" to ""mw-content-ltr mw-parser-output".