Badaro / MTGODecklistCache.Tools

Tools used to update MTGODecklistCache.
2 stars 0 forks source link

Review Manatraders Scraper #16

Open Badaro opened 2 months ago

Badaro commented 2 months ago

Manatraders has changed their website to hide the names of the players asides from the Top 8. https://www.manatraders.com/tournaments/53

That by itself is not a problem... except you can no longer find the Top 8 players in the decklists CSV. My guess is that the anonymization logic was done in a hurry and they're hiding the player names in the CSV but not the standings, but this makes the CSV fairly useless as you can't match the Top 8 players to their deck.

The fact that you can get the player names by simply opening the decklists is strong evidence that this implementation was not very well thought out. As an example, just by clicking on the decklist page you can find out that "L**g" is "Lordegg".

For now I disabled the scraper, but there's a few options to go:

  1. Wait and see if Manatraders will fix/improve this.
  2. The anonymization logic seems incredibly trivial - looks like it's just [first_letter][bunch_of_asterisks][last_letter], and if this theory is correct I could "anonymize" the Top 8 players myself and match them to the decklists. There's a risk of duplicates but we can wait to see if that'll become a problem or drop the duplicates to preserve the rest of the decklists.
  3. Ignore the CSVs and scrape direcly from the website. Not hard to do, but a lot of work for 1 tournament/month. Also would have the issue that the rounds data wouldn't fully match the standings as that is only available as a CSV.

Option 2 is likely the best solution for now, it's easy to implement and ensures compatibility, but I'll probably wait a few weeks to see if there'll be more changes in the website. Considering how trivial it is to bypass the current anonymization logic I'm assuming that'll be necessary.

Badaro commented 2 months ago

@aliquanto3 mentioned to me on Discord that there's some players with more than 200 cards listed in the CSV, and given this is a Duel Commander tournament this should've been impossible for a single player.

image

This confirms that the anonymization logic in place is trivial and causes duplicates, and means that besides the issues mapping the Top 8 players we also have no way to match those duplicates correctly if we stick to the CSV.

Spigushe commented 2 months ago

As @Aliquanto3 said on X, I'd be able to publish a Jupyter Notebook for scrapping decklists with people aliases. Rounds won't be usable anyway except for top8 where a bit of prediction could help rebuild their path in the rounds.

Badaro commented 1 month ago

Looks like Manatraders eventually fixed the Standings, but not the Pairings. For now I'll remove those and scrape the remaining information.