elpendor / ES-scraper

A scraper for EmulationStation
47 stars 41 forks source link

Interactive Scrape #12

Closed ReliCWeb closed 10 years ago

ReliCWeb commented 11 years ago

How about an interactive scrape that asks the user if the fetched info is correct for each rom? For example, it would go through each rom one by one, and for each one it would display the fetched info (descrip, publisher, release date, etc) and ask if the fetched info is correct. If so, download and move on to the next rom. If not, fetch the next hit from the search. Perhaps display a list of 'hits' for each rom and let the user pick the correct one. Also let the user skip and move to the next rom.

Sure this would be a very tedious process for large collections. But if you were to run it once for each system it wouldn't be too bad. After that, new rom entries could be manually added, or a differential scrape could be done (only scrapes roms that don't have current info/art present).

If any of this is already implemented I apologize! Thanks for an awesome script!

elpendor commented 11 years ago

Aloshi already mentioned the same idea to me a while ago.

I'm probably adding this on the next big update, with a couple more enhancements.

robertybob commented 11 years ago

Would this require some form of GUI in order to see the pictures? Or would the it be the filenames of the images which the user compares?

ReliCWeb commented 11 years ago

I would think visual confirmation would be best, but that's just IMHO. Because this would require a GUI implementation, perhaps just validating the image filename along with the other fetched info (developer, publisher, release date, genre, etc) would suffice.

elpendor commented 11 years ago

I don't plan on showing boxarts to confirm, just the game's metadata. Each DB has some form of moderation so I'll assume the correct boxarts are in place.

Aloshi commented 11 years ago

For what it's worth, integrating the scraper with ES itself is something I would like to do down the line. But because the scraper works as-is and the integration would be time consuming (the script would be rewritten in C++, which is missing a lot of the python stuff ES-scraper makes use of, which means it'll have to be replaced with a library), I'm leaving it for (much) later.

elpendor commented 11 years ago

Just commited a basic version of this. Using the switch -m, you can now choose from multiple results (assuming there's more than one).

Not gonna close this yet though. I might add the possibility of manually re-entering a query, something Aloshi mentioned a while back.

ReliCWeb commented 11 years ago

Excellent, I'll definitely try it out with my collection and let you know how it works. I know I had a few roms that were incorrect, but only because they grabbed the first result. If they would have grabbed a subsequent result they would have been correct.

ReliCWeb commented 11 years ago

Well I did a quick test run and it seems to work. When it hit Doom, it prompted for me to select which Doom game it was. Most others it just automatically ran (assuming only 1 hit). I didn't let it run too long (CTRL+C'd out) as I've put a ton of manual work into my gamelists and didn't want anything overwritten.

Probably over the weekend I'll make a backup of all my gamelists and do a manual scrape of my whole collection (~150 games over most of the systems). I know I'll definitely hit some multiple-result choices. I'll let you know how that goes.

chewi commented 11 years ago

This feature might be less necessary if the ordering issue is fixed but it could still be useful.

elpendor commented 11 years ago

This feature might be less necessary if the ordering issue is fixed but it could still be useful.

Yeah, I called it (for some reason) the "numbering" issue. I thought about implementing the same hack you mentioned on that post, but it's really the APIs problem and they should fix that.

I think Aloshi mentioned that it was a common error and he saw that on another online database as well.

Thanks for reporting that to them, I was going to eventually. Hopefully, they'll do something about it.

elpendor commented 11 years ago

Apparently that bug was filed on their Github repo. 5 months ago. It's still open.

transcendtient commented 11 years ago

I've been scraping alot of roms and the only problem I have is the roms that are close to the name from the database but vary by transposing roman numerals for numbers are not matched.

An example would be... "Teenage Mutant Ninja Turtles 2" won't match to the database of "Teenage Mutant Ninja Turtles II" It usually skips roms with this specific problem without giving you a choice of titles.

Also sometimes there are so many matches it scrolls off the screen and I don't know which rom it's trying to match anymore. Maybe "| more", or equivalent, the list so it doesn't scroll off the screen before you get a chance to look at it.

elpendor commented 11 years ago

I've been scraping alot of roms and the only problem I have is the roms that are close to the name from the database but vary by transposing roman numerals for numbers are not matched.

An example would be... "Teenage Mutant Ninja Turtles 2" won't match to the database of "Teenage Mutant Ninja Turtles II" It usually skips roms with this specific problem without giving you a choice of titles.

It has been said before already, thegamesdb.net has some ordering issues.

The scraper tries to be as automatic as possible and picks the first result, assuming its the most accurate. I could try some hacks, but it's really up to them to fix it. I'm almost sure the API itself does some conversion between roman and numerals when you search, so that's kinda pointless to implement from my end. Like I said, it's up to them.

Until then you'll have to use the -m switch and choose the game manually. I'm gonna add the possibility to re-enter the query by hand soon, if needed. That should help things a little bit more.

Also sometimes there are so many matches it scrolls off the screen and I don't know which rom it's trying to match anymore. Maybe "| more", or equivalent, the list so it doesn't scroll off the screen before you get a chance to look at it.

There shouldn't be that many results on screen unless the filename is really vague (or really common). But if you missed some results you can always scroll up and down using Shift+Page Up/Down.

transcendtient commented 11 years ago

I'm so happy you told me about the shift page up down thing. You're three times as awesome now. Sports Illustrated - Championship Football and Baseball.smc gave me around 28 hits. Street Fighter II - The World Warrior.smc gave me 42 hits :P

I'm also getting alot of skips when the list of games is large, "Super Ghouls 'N Ghosts.smc" gives 75 hits, and skips. I believe that, being a popular game, must have information in their database.

Alot of the games with "Super" in them skip, the list of candidates is large, and as I said there is no doubt a matching game in the database for these.

chewi commented 11 years ago

I didn't realise that the TheGamesDB is open source when I posted. I have some ideas that should make this work much better. Bear with me. My PHP is very rusty. ;-)

transcendtient commented 11 years ago

I finally got an error that kicked me out of the scraper. It finished my NES games completely. (AND wrote the XML) I got to the SNES title (my second system) Wayne Gretsky and the NHLPA All-Stars and got an error that exited the scraper. This did not write the XML.

line 281, in scanFiles FILE "/usr/lib/python2.7/dist-packages/PIL/image.py", line 1980 in open raise IO error("cannot identify image file")

I'm guessing this is a database error, but have no idea. I'll probably just be deleting the rom since it's not great anyway.

elpendor commented 11 years ago

I finally got an error that kicked me out of the scraper. It finished my NES games completely. I got to the SNES title (my second system) Wayne Gretsky and the NHLPA All-Stars and got an error that exited the scraper.

line 281, in scanFiles FILE "/usr/lib/python2.7/dist-packages/PIL/image.py", line 1980 in open raise IO error("cannot identify image file")

I'm guessing this is a database error, but have no idea. I'll probably just be deleting the rom since it's not great anyway.

Works fine here, I don't think its related to that game. Seems to me like the image file somehow got corrupted (maybe a memory error, you mentioned lots of ROM, maybe you interrupted the scraping at some point), look for it, delete it and try scraping again.

I'll add a new check later to avoid the scraper from stopping altogether. I've been doing some testing and never ran into corrupted images. Until I know how that file was damaged, I can't do much about it.

If it happens again, I'll need as much info as possible. (platform, filename, switches used, complete list of files on that folder).

elpendor commented 11 years ago

Also, the skipping on those long lists has something to do with some non-ascii characters. Gonna fix that soon.

Edit: Just commited the fix to the repo.

transcendtient commented 11 years ago

What I've done to make life easier is to put it on manual scraping, weigh down the enter key, and then I can come back later to label all the ones with a choice of what it may be.

of 664 SNES roms, it found: 496 jpg PLUS 79 more with the update :575 185 matches with no intervention 350 matches with intervention PLUS 89 more with the update :439 40 not matched at all, I haven't checked these yet. 3 had a list of things to choose from but no valid entries. EDIT : Fixed numbers