ScrappyCocco / HowLongToBeat-PythonAPI

A simple Python API to read data from howlongtobeat
https://pypi.org/project/howlongtobeatpy/
MIT License
84 stars 5 forks source link

Add game year into HowLongToBeatEntry #8

Closed kparal closed 3 years ago

kparal commented 3 years ago

Hi, HLTB changed recently and the game year is no longer a part of the game title. For example, search for God of War, you'll now see two games with the same title, and the year is in a different color on the website. Can you please expose the year in HowLongToBeatEntry, so that we can access this information? Thanks a lot!

ScrappyCocco commented 3 years ago

I'll look if I'm able to do that!

ScrappyCocco commented 3 years ago

@kparal so basically that's not currently possible: because the release year is not included in the HTML I use to parse the title/image and times.

One solution I see is to add another function, such as:

async_get_release_date_from_id(self, game_id: int)
get_release_date_from_id(self, game_id: int)

This would need to make another request to HLTB to get the full game page, and parse the HTML to find the release date

Would you like this idea?

kparal commented 3 years ago

This would need to make another request to HLTB to get the full game page, and parse the HTML to find the release date

Well it's certainly better than nothing :slightly_smiling_face: But I wonder if you really can't do it during the search request.

When I search for "God of War" I see this:

search

The title and year are in different color. The html contains this:

<div class="search_list_details">
  <h3 class="shadow_text">
  <a class="text_green" title="God of War" href="game?id=38050">God of War</a>
  <strong class="text_grey">(2018)</strong>
  </h3>

It should be possible to retrieve it from this. Or are you parsing a different page? (The same content seems to be retrieved by the browser from https://howlongtobeat.com/search_results via JavaScript)

ScrappyCocco commented 3 years ago

@kparal as I was saying that's not currently possible, because the API use another URL to get a simpler result If you look into the code, it does a POST request to search_results.php

The semplified result doesn't contain all those informations but just the bare minimum, for every game in the search result there is something such as:

<li class="back_darkish"
   style="background-image:linear-gradient(rgb(31, 31, 31), rgba(31, 31, 31, 0.9)), url('/games/34553_Grip.jpg')">
   <div class="search_list_image">
      <a aria-label="GRIP Combat Racing" title="GRIP Combat Racing" href="game?id=34553">
      <img alt="Box Art" src="/games/34553_Grip.jpg" />
      </a>
   </div>
   <div class="search_list_details">
      <h3 class="shadow_text">
         <a class="text_white" title="GRIP Combat Racing" href="game?id=34553">GRIP: Combat Racing</a>
      </h3>
      <div class="search_list_details_block">
         <div>
            <div class="search_list_tidbit text_white shadow_text">Main Story</div>
            <div class="search_list_tidbit center time_50">13 Hours </div>
            <div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
            <div class="search_list_tidbit center time_40">14 Hours </div>
            <div class="search_list_tidbit text_white shadow_text">Completionist</div>
            <div class="search_list_tidbit center time_00">--</div>
         </div>
      </div>
   </div>
</li>

As you can see there are no release date informations in there. So if you need the release date, the only solution is to make a request to the full game page (such as https://howlongtobeat.com/game?id=38050 ) and make a parser for that page to get the extra needed informations, such as the release date for now, and maybe something more in the future (such as platforms maybe?).

I'll await your opinion on this ^^

kparal commented 3 years ago

I'm sorry for being particularly stubborn today :smile: I read the web request code and sent the same request using curl:

curl -d 'queryString=God of War' -d t=games -d shorthead=popular -d 'sortd=Normal Order' -d plat= -d length_type=main -d length_min= -d length_max= -d detail= 'https://howlongtobeat.com/search_results.php' --user-agent 'Mozilla'

And I received the following html:

<li class="back_darkish" style="background-image:linear-gradient(rgb(31, 31, 31), rgba(31, 31, 31, 0.9)), url('/games/252px-Gowbox.jpg')">
  <div class="search_list_image">
    <a aria-label="God of War" title="God of War" href="game?id=3974">
      <img alt="Box Art" src="/games/252px-Gowbox.jpg" />
    </a>
  </div>
  <div class="search_list_details">
    <h3 class="shadow_text">
      <a class="text_green" title="God of War" href="game?id=3974">God of War</a>
      <strong class='text_grey'>(2005)</strong> </h3>
    <div class="search_list_details_block">
      <div>
        <div class="search_list_tidbit text_white shadow_text">Main Story</div>
        <div class="search_list_tidbit center time_100">9 Hours </div>
        <div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
        <div class="search_list_tidbit center time_100">9&#189; Hours </div>
        <div class="search_list_tidbit text_white shadow_text">Completionist</div>
        <div class="search_list_tidbit center time_100">12&#189; Hours </div>

I can see the year there:

      <strong class='text_grey'>(2005)</strong> </h3>

Of course it is there just for some titles, as shown on the screenshot. GRIP doesn't contain it, because there aren't multiple games with the same title, unlike God of War.

Perhaps I'm still misunderstanding something, but it seems to me that that year can be parsed out during the initial search request, for titles that contain it.

ScrappyCocco commented 3 years ago

@kparal

Of course it is there just for some titles, as shown on the screenshot. GRIP doesn't contain it, because there aren't multiple games with the same title, unlike God of War.

Oh well, yes, it could theoretically be read like that, but is not a long-term solution, this is why I don't like it that much.

First of all that contain only the year, while others might need the full release date;

Also, is not in a "stable" place in my opinion, as we saw they just added it. Maybe they'll change format again and the whole parser will need to be changed. Making a new one to read specific proprieties from the game page seems a bit more stable and long-term in case we want stuff such as Developer/Genres or other stuff

But, again, I'm open to hear your opinion/suggestion

kparal commented 3 years ago

I understand your view. If you want to read the game details, you want to do it properly - full metadata, with likely stable field names, etc.

But honestly, that wasn't what I was after when requesting this. In my small app using your library, the game titles used to be unique/unambiguous, because when it was needed, the year was hardcoded into the title, to distinguish "God of War"s etc. When I used my app to search for "God of War", I knew what I'm looking for (whether 2005 or 2018) and I knew which option to select. Now they changed it, and these games have the same title. And suddenly I don't have a way to distinguish them - the search results contain "God of War" and "God of War"... :disappointed: . I need to open up the web browser, search for it manually, then look at the game ID of the desired game, and insert it into my app. Before, it was easy, and now it is tiresome. And I'm trying to make it easy again :slightly_smiling_face:

I don't really want to know the full release date, especially by region, as it is provided by the game detail page. For example for Demon's Souls (2009) it lists:

NA: October 06, 2009 EU: June 25, 2010 JP: February 05, 2009

That's too much unnecessary information, and the years even differ, which requires some additional thinking. I'm really after about the super-simple distinguisher HLTB displays right on the search page (so for Demon's Souls that's 2009 or 2020 - clear and simple).

So here's an idea. HowLongToBeatEntry now contains game_id and game_name. What if you extended it with game_suffix? That would be the suffix as displayed on the search page (if provided, otherwise empty/None). It can be documented this way, you don't need to claim it's the release year or anything. It's just the suffix identifier they add to game titles to make the search results more relevant, and that's what you reprint into that variable. It can be a literal copy, so even including the parenthesis. If they change the suffix contents it in the future, it will either automatically still work in your library (if they don't change the CSS), or perhaps a small CSS patch will be needed, but the semantics won't change - you won't need to rename the variable, it will keep its meaning. Of course it doesn't need to be named game_suffix, it can be game_distinguisher or game_discriminator or search_distinguisher or similar (English is not my native language, perhaps there's a better word). What do you think?

ScrappyCocco commented 3 years ago

Oh I like this game_suffix idea, I'll try to implement it and I'll keep you posted! I'm a bit busy with work so it might take a few days tho

ScrappyCocco commented 3 years ago

The edit seems to work, I'll just check the code and I should be able to push the new version tomorrow image

ScrappyCocco commented 3 years ago

Done with release 0.1.18

I hope it works as expected (and let's hope they're not gonna change the HTML/style again ahah)

Thank you for using my library

kparal commented 3 years ago

Thanks! I'll try to test it soon.

Btw, I looked at HTMLResultParser.py. You might want to consider using BeautifulSoup, it will make your life much easier ;-) Cheers.

ScrappyCocco commented 3 years ago

Thanks! I'll try to test it soon.

Btw, I looked at HTMLResultParser.py. You might want to consider using BeautifulSoup, it will make your life much easier ;-) Cheers.

Maybe, yes ahah but I never expected this library to come this far Also, at first it was basically a Python clone of ckatzorke/howlongtobeat so I used the base python HTML parser to make it similar

It could indeed be reworked, but I have no plans to do it for now