AWeirdDev / flights

Fast, robust Google Flights scraper (API) for Python. (Probably)
https://pypi.org/project/fast-flights
35 stars 9 forks source link

No results in sample code #1

Closed alxcnwy closed 2 weeks ago

alxcnwy commented 6 months ago

Hi,

First of all well done - awesome project!

I get the following error when I try to run the sample code:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 31
     28 print("The price is currently", result.current_price)
     30 # Display the first flight
---> 31 print(result.flights[0])

IndexError: list index out of range

result is Result(current_price='', flights=[])

plz halp :)

P.S. tried adding you on discord but your id isn't working. DM me on twitter (just followed you) - I'm looking to use this API for something fun :)

AWeirdDev commented 6 months ago

Hi there,

Were you trying to run the example code? If so, I've ran the code on different platforms and everything seems working as intended.

May I request for additional details on what airport filters did you add? They're inside of FlightData and are named from_airport and to_airport.

Given the result dataclass provided (Result(current_price='', flights=[])), it is possible that no flights were found based on the current filter. You can visualize the search on Google Flights to see if there's also no results.

P.S. are you making an ai integration? this project was made because i wanted to see if ai can search for flights, so here we are ;)

Cheers, AWeirdDev

alxcnwy commented 6 months ago

Hey I just ran the sample code from the readme with no changes which is what made me raise the issue 😅

On Thu, May 16, 2024 at 12:44 AM JC @.***> wrote:

Hi there,

Were you trying to run the example code? If so, I've ran the code on different platforms and everything seems working as intended.

May I request for additional details on what airport filters did you add? They're inside of FlightData and are named from_airport and to_airport.

Given the result dataclass provided (Result(current_price='', flights=[])), it is possible that no flights were found based on the current filter. You can visualize the search on Google Flights https://flights.google.com to see if there's also no results.

P.S. are you making an ai integration? this project was made because i wanted to see if ai can search for flights, so here we are ;)

Cheers, AWeirdDev

— Reply to this email directly, view it on GitHub https://github.com/AWeirdDev/flights/issues/1#issuecomment-2113602183, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3NS5AGU6K5T5J3LRKPZLZCPQMBAVCNFSM6AAAAABHXZLUSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGYYDEMJYGM . You are receiving this because you authored the thread.Message ID: @.***>

--

🔗 connect with me on linkedin https://www.linkedin.com/in/alxcnwy/

📅 schedule a call with me https://calendly.com/numberboost/short-call

🙌 follow me on twitter https://twitter.com/alxcnwy

📱 +2783 949 1917

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

AWeirdDev commented 6 months ago

Hi again,

That's weird, since on most clients the code works perfectly fine. We can do a little troubleshooting to better understand what's going on here.

  1. Re-install dependencies. Sometimes this kind of issue is caused by selectolax. We can try re-installing and see if the error still persists.

    $ pip uninstall selectolax -y
    $ pip install -U selextolax
  2. Generate a search visualization URL. We can use TFSData (from create_filter()) to get the Base64 string and print the URL.

    filter = create_filter(...) # Your filter
    b64 = filter.as_b64().decode('utf-8')
    print(
    "https://www.google.com/travel/flights?tfs=%s" % b64
    )

Copy the URL and view the page on a web browser. If there's no flight data, then it proves that it might be a "region," "date," or "airport" issue. Try changing the parameters.

  1. API. I can host a dedicated API if none of the above works. That way, no errors should persist on your side (we'll keep it to Vercel, though).

Best, AWeirdDev

alxcnwy commented 6 months ago

Working now, thanks!

IHannes commented 6 months ago

Hi,

i have the same issue but unfortunately none of the above mentioned fixes works. The link that is generated based on the filter is correct and works in a web browser but i don't get any results when trying to scrape.

Any help would be much appreciated!

Greetings from Germany

AWeirdDev commented 6 months ago

Hi there,

During the development of this project, I did get some uncaught errors when using selectolax to parse the HTML contents while the selectors, responses are all functional. I'll inspect the code now, and I'll keep you updated.

Best, AWeirdDev

IHannes commented 6 months ago

Hey AWeirdDev,

thank you so much for looking into it, i really appreciate all your effort! This is the output of the request_flights function, maybe it will help you.

Best regards

Hannes output.txt

AWeirdDev commented 6 months ago

Hey there,

Thanks for providing the HTML output! I did some digging based on your output and it seems like this line (from source):

https://github.com/AWeirdDev/flights/blob/978c70b2ec03307aef459ccd90b4e092510e4b43/fast_flights/core.py#L45

Seems to be not working. The main issue is that the parser cannot select div[jsname="IWWDBc"] (contains "best flights") and div[jsname="YdtKid"] (contains "other flights").

I copied the HTML output to an online HTML viewer, and this was what I got.

Google-Flights-Very-Googlish

And to prove that it's not injected by Google in runtime:

IMG_2771

This is indeed the main reason!


A quick recap on what happened:

Cheers, AWeirdDev

IHannes commented 6 months ago

Thank you so much for your Help!

I added cookies = { "CONSENT": "PENDING+987", "SOCS": "CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg" }

and

def request_flights(tfs: TFSData) -> requests.Response: r = requests.get( "https://www.google.com/travel/flights", params={ "tfs": tfs.as_b64(), "hl": "en", "tfu": "EgQIABABIgA", # show all flights and prices condition }, headers={"user-agent": ua, "accept-language": "en"}, cookies = cookies )

and it works now!!

AWeirdDev commented 6 months ago

Hi again,

That's great news. I've updated the project (v0.3) and now you can add custom **kwargs such as cookies to requests.get so there's no need to clone the source.

# tag: v0.3
get_flights(filter, cookies={…}, proxies={…}, ...)
Details

Commit: https://github.com/AWeirdDev/flights/commit/696885c9364d68f38b6563b1e6d61a1f4fd33705 Install v0.3: `pip install fast-flights==0.3`

Additionally, the cookie provided (CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg) is also a Protobuf string, so I'll plan to support bypassing this screen (ToS) when I have time.

Cheers, mate!

witoldprzygoda commented 6 months ago

First of all - thanks for sharing your work! Second - I tried all of the above and I get exactly the same error. 1) tried both with installed fast-flights and cloned repo 2) reinstalled selectolax 3) cookies = { "CONSENT": "YES+" } result = get_flights(filter, cookies=cookies) 4) the example I try is present on GF https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDYtMDNqBRIDS1JLcgUSA1NaWUIBAUgBmAEC 5) the same with your genuine example https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDctMDJqBRIDVFBFcgUSA01ZSkIDAQECSAGYAQI=

No prices result.flights is just an empty list

A bit of debug: def get_flights(tfs: TFSData, kwargs: Any) -> Result: print(tfs,kwargs) rs = request_flights(tfs, kwargs)

results = parse_response(rs)

return rs
#return results

I get TFSData('hello', flight_data=[FlightData(date='2024-06-03', from_airport=KRK, to_airport=SZY)]) {'cookies': {'CONSENT': 'YES+'}} then rs is just only <Response [200]>

AWeirdDev commented 6 months ago

Hey there,

Could you provide me the what requests returned? I'll need to inspect the HTML on your end -

from fast_flights.core import request_flights

r = request_flights(tfs, cookies=cookies)
with open(".html", "wb") as f:
  f.write(r.content)

...and please provide me with the created .html file.

Best, AWeirdDev

witoldprzygoda commented 6 months ago

Interesting... is it again about consent to Google Terms despite the fact "cookies" are approved? I see inside HTML Before you continue

I added in the code

cookies = { "CONSENT": "YES+" } result = get_flights(filter, cookies=cookies)

r = request_flights(filter, cookies=cookies) with open("test.html", "wb") as f: f.write(r.content)

Thanks!

AWeirdDev commented 6 months ago

Hey there,

I searched it on Stackoverflow again and this may work:

cookies = {
    "CONSENT": "PENDING+987",
    "SOCS": "CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg"  # protobuf
}
Additional details
As for `SOCS`, this is my approach to implementing parsing in Protobuf: ```protobuf syntax = "proto3"; message Information { uint seed_a = 1; // unsolved string gws = 2; // gws_YYYYMMDD-0_RC2 string locale = 3; // e.g., de, en... uint seed_b = 4; // unsolved } message Datetime { uint timestamp = 1; // e.g., 1692144000 } message SOCS { uint seed = 1; Information info = 2; Datetime datetime = 3; } ```

Please let me know if adding SOCS fixes the issue.

Regards, AWeirdDev

witoldprzygoda commented 6 months ago

Nope :-(

The list (if result is uncommented) is still empty. I attach (pretty larger now) test.html file... test2.zip And test.py script test.py.zip

AWeirdDev commented 6 months ago

[test.txt](https://github.com/AWeirdDev/flights/files/15445695/test.txt)

I adjusted the source code a little bit, and here's what I got:

IMG_2780

And in this line from source:

https://github.com/AWeirdDev/flights/blob/c6e4d964e7eee673ae85f1f2ce3a18ff63147ef6/fast_flights/core.py#L49

It removes the last item from the list causing the ONLY result to be shadowed. Originally during the development process, I added this as a safety assurance since there's a huge chance of exiting/crashing for no reason on the last found result.

If you cloned from source, just remove the [:-1]. However, I cannot gurantee that the result would look pretty (as seen in my Replit Mobile screenshot, some random characters pop out).

In the next version, I'll add the dangerously_allow_looping_last_item to the options as well as the currency options and cookie customization/improvement.

Thanks for pointing this out!

Cheers, AWeirdDev

witoldprzygoda commented 6 months ago

Thanks for the debugging! - usually it is much easier for the developer to find the problem :-) However, am I right, that I do not see the price in the output now? (for the very moment 210 PLN)

Result(current_price='high', flights=[Flight(is_best=True, name='LOT', departure='4:55 PM on Mon, Jun 3', arrival='6:05 PM on Mon, Jun 3', arrival_time_ahead='', duration='1 hr 10 min', stops=0, delay=None)])

I would expect price: float filed in Flight class but no idea which selector in core.py e.g. price = safe(item.css_first("HERE_SOMETHING")).text() would be good. This html output has multiple occurences with this value...

OK, I tried something like this:

            # Get price
            price_text = safe(item.css_first("div.YMlIz.FpEdX.jLMuyc > span")).text()
            try:
                current_price = float(price_text.replace('PLN', '').strip())
            except ValueError:
                current_price = None

plus

@dataclass
class Flight:
    is_best: bool
    name: str
    departure: str
    arrival: str
    arrival_time_ahead: str
    duration: str
    stops: int
    delay: Optional[str]
    price: float

but to be honest, this is hard wired PLN plus no idea if the selector "div.YMlIz.FpEdX.jLMuyc > span" is something generic. Perhaps one should somehow force a given currency (e.g. USD or EUR).

Another thing which might be missing is the information delivered for round-trip etc. I am able do define correcnt filter = create_filter( with flight_data=[ list od dates. But the output is e.g.

Flight(is_best=True, name='LOT', departure='4:55 PM on Mon, Jun 10', arrival='6:05 PM on Mon, Jun 10', arrival_time_ahead='', duration='1 hr 10 min', stops=0, delay=None, price=288.0)

so I can see only a departure date Jun, 10 (and not a return flight, e.g. Jun, 17). This is an issue for a real scanner :-) and even more, for the multi-leg, no information on every leg and date would make the tool not usable.

One more remark. Actually I can write a script which does a loop in the loop etc. then printing departure, arrival, date, and price (just from the loop). However, I immediately see that many queries are lost because of not loaded result data. E.g. some mixture of ex-European to Japan flights (e.g. the second line URL):

https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTA4agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-08, Price: 4014.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTA5agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEwagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-10, Price: 2940.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-11, Price: 2940.0
......
......
(after a while it gets even worse) - no parsing at all

Departure: ARN, Arrival: TYO, Departure Date: 2024-09-02, Return Date: 2024-09-10, Price: 3857.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEyagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEzagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTE0agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEwagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEyagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEzagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
......
......
then it shows up sometimes

https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTA5agUSA09TQXIFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: OSA, Departure Date: 2024-09-01, Return Date: 2024-09-09, Price: 2789.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTEwagUSA09TQXIFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: OSA, Departure Date: 2024-09-01, Return Date: 2024-09-10, Price: 3918.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTExagUSA09TQXIFEgNBUk5CAQFIAZgBAQ==

My guess is that something should be done within: def request_flights(tfs: TFSData, **kwargs: Any) -> requests.Response: like checking the content of

       try:
            r.raise_for_status()
            if "Best departing flights" in r.text:
                return r

plus some retries or pauses (sleep). And obviously, it will terribly slow down the scan. Which probably should be done... in parallel. The script (if you wish to test): japan_scan.py.zip

This seems to help in my testing:

def request_flights(tfs: TFSData, **kwargs: Any) -> requests.Response:
    max_retries = 4  # Maximum number of retries
    wait_time = 10    # Time to wait between retries in seconds

    for attempt in range(max_retries):
        r = requests.get(
            "https://www.google.com/travel/flights",
            params={
                "tfs": tfs.as_b64(),
                "hl": "en",
                "tfu": "EgQIABABIgA",  # show all flights and prices condition
            },
            headers={"user-agent": ua, "accept-language": "en"},
            **kwargs
        )

        try:
            r.raise_for_status()
            # Check if the expected results are ready (you can adjust this part based on the actual response)
            if "Best departing flights" in r.text:
                return r
        except HTTPError as http_err:
            print(f"HTTP error occurred: {http_err}")
        except Exception as err:
            print(f"Other error occurred: {err}")

        # Wait before retrying
        time.sleep(wait_time)

    # Raise an error if the maximum number of retries is reached
    r.raise_for_status()
    return r

but this is not the golden solution. There are still responses wihich do not end up with the result on time... even when repeating them (well perhaps one should increase timeout with every attempt).

AWeirdDev commented 5 months ago

Hi there,

Thanks for the huge dedication! I'll look into this tomorrow.

P.S. I sincerely apologize for the huge delay. I've been busy this week.

Regards, AWeirdDev

witoldprzygoda commented 5 months ago

Hi,

First of all, thank you for dedicating a significant amount of time to writing code that works and can serve as a starting point for further development. Many scrapers don't work at all, as you have likely noticed. However, I believe your code has the potential to become a cool tool, operating differently from all those Selenium simulators. On another note, I'm working on something similar with the ITA Matrix service, which is an excellent source for both prices and airfares. In its new version, the address is constructed from JSON to Base64, requiring browser emulation to retrieve the result.

(side example for ITA Matrix)

import requests
import base64
import json

# Your Base64 encoded query string
encoded_query = "eyJ0eXBlIjoib25lLXdheSIsInNsaWNlcyI6W3sib3JpZ2luIjpbIkpGSyJdLCJkZXN0IjpbIkxBWCJdLCJyb3V0aW5nIjoiIiwiZXh0IjoiIiwicm91dGluZ1JldCI6IiIsImV4dFJldCI6IiIsImRhdGVzIjp7InNlYXJjaERhdGVUeXBlIjoiY2FsZW5kYXIiLCJkZXBhcnR1cmVEYXRlIjoiMjAyNC0wOS0xNSIsImRlcGFydHVyZURhdGVUeXBlIjoiZGVwYXJ0IiwiZGVwYXJ0dXJlRGF0ZU1vZGlmaWVyIjoiMCIsImRlcGFydHVyZURhdGVQcmVmZXJyZWRUaW1lcyI6W10sInJldHVybkRhdGVUeXBlIjoiZGVwYXJ0IiwicmV0dXJuRGF0ZU1vZGlmaWVyIjoiMCIsInJldHVybkRhdGVQcmVmZXJyZWRUaW1lcyI6W119fV0sIm9wdGlvbnMiOnsiY2FiaW4iOiJDT0FDSCIsInN0b3BzIjoiLTEiLCJleHRyYVN0b3BzIjoiMSIsImFsbG93QWlycG9ydENoYW5nZXMiOiJ0cnVlIiwic2hvd09ubHlBdmFpbGFibGUiOiJ0cnVlIiwiY3VycmVuY3kiOnsiZGlzcGxheU5hbWUiOiJVbml0ZWQgU3RhdGVzIERvbGxhciAoVVNEKSIsImNvZGUiOiJVU0QifSwic2FsZXNDaXR5Ijp7ImNvZGUiOiJOWUMiLCJuYW1lIjoiTmV3IFlvcmsifX0sInBheCI6eyJhZHVsdHMiOiIxIn19"

# Form the URL
base_url = "https://matrix.itasoftware.com/calendar?search="
full_url = f"{base_url}{encoded_query}"

# Custom headers (replace these with actual headers from your browser)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://matrix.itasoftware.com/",
    "Cookie": "YOUR_SESSION_COOKIES_HERE"
}

# Send the GET request with custom headers
response = requests.get(full_url, headers=headers)

# Check the response
if response.status_code == 200:
    print(response.url)
    print(response.text)  # You can parse this HTML using BeautifulSoup as needed
else:
    print(f"Failed to retrieve the data. Status code: {response.status_code}")
AWeirdDev commented 5 months ago

Hey,

Thanks. I've added your recommendation. (See Project) This could be an alternative when Google Flights isn't working properly.

As a side note, we won't be using BeautifulSoup since it will cause the flow to be prolonged, ultimately blocking our script.

Thanks for your contribution!

Regards, AWeirdDev