dijs / infobox-parser

Parse Wikipedia Infoboxes
40 stars 18 forks source link

Need to be able to get the tracklist section from album page #32

Open foderking opened 3 years ago

foderking commented 3 years ago

i'm writing an app for get album information. right now i'm using an hackby first using regex to get the "tracklist" section , then parsing that.

it would be cool to be able to parse tracklist easily - espcially for double albums where you have 2 or more "{{tracklist...}}" sections

dijs commented 3 years ago

That sounds like a cool feature! Could you give me a few wikipedia page examples please?

foderking commented 3 years ago

https://en.wikipedia.org/wiki/Scorpion_(Drake_album) https://en.wikipedia.org/wiki/The_Best_in_the_World_Pack https://en.wikipedia.org/wiki/Positions_(album)

Generally any page for an album. Parsing the wikitext source ignores the "tracklist section", thats why i have to use regex first to get only section and then parse that.

dijs commented 3 years ago

So, this is an interesting and difficult problem. First of all, the track listings are not ever in a infobox. This parser has stretched itself to parse other things (albeit, not very well) outside of infoboxes, but I do not think it was wise to do that.

That being said, I may try and refactor out my data-types to common components which can be used to parse infoboxes, page sections, or even entire page sources.

It's a complex problem, like many that come up in wiki-text parsing.

By the way, how was the parsed version of the album when you did it manually? If it was nice, I may just hack that together for now.

foderking commented 3 years ago

i did a regex match for the tracklist section /{{track.*list.*?^}}/gmsi This also captures when there are like 2 tracklist sections I then parse the sections independently with the infobox. it works pretty well, although producer info is kept in the "extra credits" in the parsed object

heres the link to the repository https://github.com/foderking/WhoProduced/blob/main/src/App.js