Add support for AKAs scrap through GraphQL

mhdzumair commented 4 months ago

@uyar I tried to add support for GraphQL API parser endpoint. Not sure about your plan for this repo. However, there are several GraphQL endpoints that can be added to scrap & parse the data.

uyar commented 4 months ago

Thanks a lot for your work! It looks very good. I'll write my comments on the individual commits.

I sure plan to continue working on this repo. There are a few basic issues I haven't sorted out yet, like whether to use jmespath or jq for parsing dict data and how to decide which pages to load based on which keys the user wants. Since I've started a new job recently I didn't have the time to focus on it.

uyar commented 4 months ago

In general, I am in favor of merging this after we sort out the questions.

mhdzumair commented 4 months ago

whether to use jmespath or jq for parsing dict data

I heard jmespath for the first time and looked into their docs, It's pretty impressive and work natively with python. I haven't used this so far. On the other hand, the python binding for jq also there. with quick look, IMO jmespath: works well with python syntax, readable & maintainable.

uyar commented 4 months ago

An earlier version of cinemagoerng used jmespath and I'm thinking of going back to it. The problem I had with it was that I could not reference the current node, something I needed in some cases. I also wanted to remove the extra query syntax and handle eveything with lxml but it's probably not worth it. jq looks better to me but I haven't checked whether it works on Windows.

uyar commented 4 months ago

OK, I've gone through this. Your code is very much in line with the project and there doesn't seem to be any better way to get the AKAs. So I'll merge it soon, probably after making a few small changes. I have to check why UpdatePage was needed; I would have just added "akas" to TitlePage. And I also have to check what is_alternative is for. It's already an alternative title by definition.

Conceptually, I don't want to use the term "AKA". I think there is an "original_title" (a string) and a list of "titles" (records with countries and notes).

uyar commented 4 months ago

I also have to check what is_alternative is for. It's already an alternative title by definition.

I guess this is for the case where a country has multiple AKAs for the movie. This seems to be unnecessarily complicated. My hunch is, instead of:

akas: list[Aka]

going with something like:

akas: dict[str, list[Aka]]

where the key is the country, and the first one in the list is the default for that country.

mhdzumair commented 4 months ago

I have to check why UpdatePage was needed; I would have just added "akas" to TitlePage.

This is needed because GraphQL doesn't return the IMDb ID or title of the movie. As those are set as required values on the model, therefore, I have to use them in the update method.

what is_alternative is for. It's already an alternative title by definition.

This is used for countries that contain two or more "aka" titles. Take a look at the matrix "aka" page.

mhdzumair commented 4 months ago

where the key is the country, and the first one in the list is the default for that country.

Actually, I need to get both values for "aka" titles, specifically the is_alternative that contains the language-specific title. Therefore, I would think we don't need to limit the data too much; rather, it can be utilized from the developers' code as per their requirements.

uyar commented 4 months ago

I cherry-picked your changes. I've merged the test case value fix and the GraphQL/AKA commits (with minor reorganizations).

I've excluded the commit about URL hashes for cache filenames. Instead I've implemented a different cache file naming approach.

If all is fine, I'm going to close this PR.

Thanks!

mhdzumair commented 4 months ago

Nice. You can close this PR. I was worked on the parental guide page parsing. need to do minor changes to parsing the data. https://github.com/cinemagoer/cinemagoerng/blob/main/cinemagoerng/web.py#L85-L93

I was thinking the best way to parse it automatically for episode, aka, etc otherwise the current if else condition will be increase for each parser.

I have two idea.

~~Make two or more post_process functions defined in specs, this way we could have a post process method to parse the data into deserialize function~~ This will be an issue for get_title function since it's trying to deserialize from root.
Make deserialize class with function mapping, and parse it in update function.

LMK your suggestion

uyar commented 4 months ago

Not really sure. Maybe see a few more special cases first?

Doesn't typeload already deserialize them into Aka objects? Do you need to deserialize afterwards?

uyar commented 4 months ago

An earlier version of cinemagoerng used jmespath and I'm thinking of going back to it

I've implemented this. I think the spec files are nicer now. Let's hope it turns out to be worth stabilizing parsers on it.

uyar commented 4 months ago

Doesn't typeload already deserialize them into Aka objects? Do you need to deserialize afterwards?

Never mind, I see why. It's been a long time and I forget the code.

cinemagoer / cinemagoerng

Add support for AKAs scrap through GraphQL #4