batocera-linux / batocera-emulationstation

MIT License
318 stars 246 forks source link

Screenscraper.fr quota not correct #1090

Closed cyberluke closed 2 years ago

cyberluke commented 2 years ago

Hi, I have upgraded membership and my quota daily is 50k:

Member : cyberluke

Member (1)1 ThreadsPending proposals : 0 / 5Pending media proposals : 0 / 5Scrapes Today :  20005 / 50000

But Batocera Emulationstation started to report quota limit (2021-12-22 17:06:02 ERROR HttpReq::onError (430) : Votre quota de scrape est dépassé pour aujourd'hui !).

Right now with every 430 request I can see the number of scrapes today will increase by one.

I see the sourcecode has default value of zero limit and then it seems it will fetch this information from the API. I tried to enable --verbose option, but do not see any additional information.

cyberluke commented 2 years ago

Ok, I will work out this with their admin. But there two tasks for me that I would like to implement for you:

1) current algorithm of checking and downloading only missing stuff is not good. It is slow and eats requests. It should skip these or there should be additional option. It should remember last position and continue from that. Everyday I scrape one folder and it always begins from A and then gets stuck.

2) be able to reverse order and start fetching from Z instead of A letter

3) mark items not found and do no try them at least for few days (expiry date for specific item scraping)

one of these or combination of these would defo help me, let me know what is the best for you, so I can create a pull request

fabricecaruso commented 2 years ago

This happens because, with Screenscraper, you have 1 request per game, plus 1 request per media. As a level 1 member, you have 20000 scrapes a day. Is you want to scrap 5 medias (image, video, box... ), you can scrape about 3300 games only.

  1. It happens because you want to scrap many medias, and some of them are not found because they were not available the first time. So the engine tries again to see if missing medias are available. There's no reason we change that, or we won't be able to scrape new medias that were previously missing. So we can't really change the way we detect games with missing medias. And I don't like the idea to store the last position. It's not a good idea. Game list can evolve. What about games that were adde before that position ?

  2. This is a bad idea : If you have 10000 games, scrapped the 3000 first games... Scraped the 3000 last games... What about the 4000 in the middle ?

  3. It's an interesting solution. It would require a per-scraper engine date storage in gamelists. Something like : <scrap scraper="screenscraper>XX/XX/XXXX. When running a scraper we could add an option to ignore games that were scrapped in the last days ( some kind of option like "ignore / 1 day / 1 week / 15 days / 1 month / 1 year ..." ). It's the best solution.

cyberluke commented 2 years ago

I'm not level 1, I have 50 000 scrapes per day and it stops on 20 000 as level 1. This is the first bug.

Next bug is that you choose to scrape only missing files, but it always starts from the beginning from the first one. Then you will exhaust the whole quota again and you never get past the first half. I'm scraping many days and it always stops and the same place after 10 hours.

I will implement solution 1) for myself. First I scrape everything I have, then the counter can reset and I can add more titles. But nobody will add new titles until everything is scraped.

The solution 2) is just a workaround for the bugs mentioned above.

So for me, there will be first mode: complete scrape (solution 1), which will also add unix timestamp to gamelist xml. After first complete scrape done, new games can be added and in between and solution 3 will come into an effect.

Solution 2 - reversing an order of scrape can be handy if you add a lot of games with letter S, T, U, V, W, X, Y, Z and you want to browse these first.

In the end the user have control over the decision. The user knows the best what he want in his specific case and he can optimize the time and cost (quota). The current algorithm is not smart enough to work for user and it is even malfuctioning, it is not able to process one directory till the end in one week (10 cycles of scrape).

Therefore I propose a complex model for efficient "data mining", where user has a better control and can make a complexity of function lower (something like lowering O(n) of the algorithm)

cyberluke commented 2 years ago

So right now, there is no difference between scraping all and scraping only missing from the quota point of view. Yes, it won't be downloaded or saved on the drive, but it will make same amount of requests. So the behavior is almost identical on fast speed connection.

fabricecaruso commented 2 years ago

All problems are now adressed in this PR : https://github.com/batocera-linux/batocera-emulationstation/pull/1097

About the "I have 50 000 scrapes per day and it stops on 20 000 [...] This is the first bug." thing : It's not in ES side, ES is not responsible for screenscraper http responses.

cyberluke commented 2 years ago

Ok, thank you. I can use my time to help on another OSS project (Chess raspberry engine)

On Thu 30. 12. 2021 at 17:27, Fabrice CARUSO @.***> wrote:

Closed #1090 https://github.com/batocera-linux/batocera-emulationstation/issues/1090.

— Reply to this email directly, view it on GitHub https://github.com/batocera-linux/batocera-emulationstation/issues/1090#event-5828505570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPKTY4OEKDWFTVUTQUTALUTSB6NANCNFSM5KS6YBSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: <batocera-linux/batocera-emulationstation/issue/1090/issue_event/5828505570 @github.com>