Closed alxcnwy closed 2 weeks ago
Hi there,
Were you trying to run the example code? If so, I've ran the code on different platforms and everything seems working as intended.
May I request for additional details on what airport filters did you add? They're inside of FlightData
and are named from_airport
and to_airport
.
Given the result dataclass provided (Result(current_price='', flights=[])
), it is possible that no flights were found based on the current filter. You can visualize the search on Google Flights to see if there's also no results.
P.S. are you making an ai integration? this project was made because i wanted to see if ai can search for flights, so here we are ;)
Cheers, AWeirdDev
Hey I just ran the sample code from the readme with no changes which is what made me raise the issue 😅
On Thu, May 16, 2024 at 12:44 AM JC @.***> wrote:
Hi there,
Were you trying to run the example code? If so, I've ran the code on different platforms and everything seems working as intended.
May I request for additional details on what airport filters did you add? They're inside of FlightData and are named from_airport and to_airport.
Given the result dataclass provided (Result(current_price='', flights=[])), it is possible that no flights were found based on the current filter. You can visualize the search on Google Flights https://flights.google.com to see if there's also no results.
P.S. are you making an ai integration? this project was made because i wanted to see if ai can search for flights, so here we are ;)
Cheers, AWeirdDev
— Reply to this email directly, view it on GitHub https://github.com/AWeirdDev/flights/issues/1#issuecomment-2113602183, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3NS5AGU6K5T5J3LRKPZLZCPQMBAVCNFSM6AAAAABHXZLUSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGYYDEMJYGM . You are receiving this because you authored the thread.Message ID: @.***>
--
🔗 connect with me on linkedin https://www.linkedin.com/in/alxcnwy/
📅 schedule a call with me https://calendly.com/numberboost/short-call
🙌 follow me on twitter https://twitter.com/alxcnwy
📱 +2783 949 1917
Disclaimer
The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
Hi again,
That's weird, since on most clients the code works perfectly fine. We can do a little troubleshooting to better understand what's going on here.
Re-install dependencies. Sometimes this kind of issue is caused by selectolax. We can try re-installing and see if the error still persists.
$ pip uninstall selectolax -y
$ pip install -U selextolax
Generate a search visualization URL. We can use TFSData
(from create_filter()
) to get the Base64 string and print the URL.
filter = create_filter(...) # Your filter
b64 = filter.as_b64().decode('utf-8')
print(
"https://www.google.com/travel/flights?tfs=%s" % b64
)
Copy the URL and view the page on a web browser. If there's no flight data, then it proves that it might be a "region," "date," or "airport" issue. Try changing the parameters.
Best, AWeirdDev
Working now, thanks!
Hi,
i have the same issue but unfortunately none of the above mentioned fixes works. The link that is generated based on the filter is correct and works in a web browser but i don't get any results when trying to scrape.
Any help would be much appreciated!
Greetings from Germany
Hi there,
During the development of this project, I did get some uncaught errors when using selectolax
to parse the HTML contents while the selectors, responses are all functional. I'll inspect the code now, and I'll keep you updated.
Best, AWeirdDev
Hey AWeirdDev,
thank you so much for looking into it, i really appreciate all your effort! This is the output of the request_flights function, maybe it will help you.
Best regards
Hannes output.txt
Hey there,
Thanks for providing the HTML output! I did some digging based on your output and it seems like this line (from source):
Seems to be not working. The main issue is that the parser cannot select div[jsname="IWWDBc"]
(contains "best flights") and div[jsname="YdtKid"]
(contains "other flights").
I copied the HTML output to an online HTML viewer, and this was what I got.
And to prove that it's not injected by Google in runtime:
This is indeed the main reason!
A quick recap on what happened:
line 45
failed to select itemsCheers, AWeirdDev
Thank you so much for your Help!
I added
cookies = { "CONSENT": "PENDING+987", "SOCS": "CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg" }
and
def request_flights(tfs: TFSData) -> requests.Response: r = requests.get( "https://www.google.com/travel/flights", params={ "tfs": tfs.as_b64(), "hl": "en", "tfu": "EgQIABABIgA", # show all flights and prices condition }, headers={"user-agent": ua, "accept-language": "en"}, cookies = cookies )
and it works now!!
Hi again,
That's great news. I've updated the project (v0.3) and now you can add custom **kwargs
such as cookies to requests.get
so there's no need to clone the source.
# tag: v0.3
get_flights(filter, cookies={…}, proxies={…}, ...)
Commit: https://github.com/AWeirdDev/flights/commit/696885c9364d68f38b6563b1e6d61a1f4fd33705 Install v0.3: `pip install fast-flights==0.3`
Additionally, the cookie provided (CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg
) is also a Protobuf string, so I'll plan to support bypassing this screen (ToS) when I have time.
Cheers, mate!
First of all - thanks for sharing your work! Second - I tried all of the above and I get exactly the same error. 1) tried both with installed fast-flights and cloned repo 2) reinstalled selectolax 3) cookies = { "CONSENT": "YES+" } result = get_flights(filter, cookies=cookies) 4) the example I try is present on GF https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDYtMDNqBRIDS1JLcgUSA1NaWUIBAUgBmAEC 5) the same with your genuine example https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDctMDJqBRIDVFBFcgUSA01ZSkIDAQECSAGYAQI=
No prices result.flights is just an empty list
A bit of debug: def get_flights(tfs: TFSData, kwargs: Any) -> Result: print(tfs,kwargs) rs = request_flights(tfs, kwargs)
return rs
#return results
I get TFSData('hello', flight_data=[FlightData(date='2024-06-03', from_airport=KRK, to_airport=SZY)]) {'cookies': {'CONSENT': 'YES+'}} then rs is just only <Response [200]>
Hey there,
Could you provide me the what requests returned? I'll need to inspect the HTML on your end -
from fast_flights.core import request_flights
r = request_flights(tfs, cookies=cookies)
with open(".html", "wb") as f:
f.write(r.content)
...and please provide me with the created .html
file.
Best, AWeirdDev
Interesting... is it again about consent to Google Terms despite the fact "cookies" are approved?
I see inside HTML
I added in the code
cookies = { "CONSENT": "YES+" } result = get_flights(filter, cookies=cookies)
r = request_flights(filter, cookies=cookies) with open("test.html", "wb") as f: f.write(r.content)
Thanks!
Hey there,
I searched it on Stackoverflow again and this may work:
cookies = {
"CONSENT": "PENDING+987",
"SOCS": "CAESHAgBEhJnd3NfMjAyMzA4MTAtMF9SQzIaAmRlIAEaBgiAo_CmBg" # protobuf
}
Please let me know if adding SOCS
fixes the issue.
Regards, AWeirdDev
Nope :-(
The list (if result is uncommented) is still empty. I attach (pretty larger now) test.html file... test2.zip And test.py script test.py.zip
[test.txt](https://github.com/AWeirdDev/flights/files/15445695/test.txt)
I adjusted the source code a little bit, and here's what I got:
And in this line from source:
It removes the last item from the list causing the ONLY result to be shadowed. Originally during the development process, I added this as a safety assurance since there's a huge chance of exiting/crashing for no reason on the last found result.
If you cloned from source, just remove the [:-1]
. However, I cannot gurantee that the result would look pretty (as seen in my Replit Mobile screenshot, some random characters pop out).
In the next version, I'll add the dangerously_allow_looping_last_item
to the options as well as the currency options and cookie customization/improvement.
Thanks for pointing this out!
Cheers, AWeirdDev
Thanks for the debugging! - usually it is much easier for the developer to find the problem :-) However, am I right, that I do not see the price in the output now? (for the very moment 210 PLN)
Result(current_price='high', flights=[Flight(is_best=True, name='LOT', departure='4:55 PM on Mon, Jun 3', arrival='6:05 PM on Mon, Jun 3', arrival_time_ahead='', duration='1 hr 10 min', stops=0, delay=None)])
I would expect price: float filed in Flight class but no idea which selector in core.py e.g. price = safe(item.css_first("HERE_SOMETHING")).text() would be good. This html output has multiple occurences with this value...
OK, I tried something like this:
# Get price
price_text = safe(item.css_first("div.YMlIz.FpEdX.jLMuyc > span")).text()
try:
current_price = float(price_text.replace('PLN', '').strip())
except ValueError:
current_price = None
plus
@dataclass
class Flight:
is_best: bool
name: str
departure: str
arrival: str
arrival_time_ahead: str
duration: str
stops: int
delay: Optional[str]
price: float
but to be honest, this is hard wired PLN plus no idea if the selector "div.YMlIz.FpEdX.jLMuyc > span" is something generic. Perhaps one should somehow force a given currency (e.g. USD or EUR).
Another thing which might be missing is the information delivered for round-trip etc. I am able do define correcnt filter = create_filter( with flight_data=[ list od dates. But the output is e.g.
Flight(is_best=True, name='LOT', departure='4:55 PM on Mon, Jun 10', arrival='6:05 PM on Mon, Jun 10', arrival_time_ahead='', duration='1 hr 10 min', stops=0, delay=None, price=288.0)
so I can see only a departure date Jun, 10 (and not a return flight, e.g. Jun, 17). This is an issue for a real scanner :-) and even more, for the multi-leg, no information on every leg and date would make the tool not usable.
One more remark. Actually I can write a script which does a loop in the loop etc. then printing departure, arrival, date, and price (just from the loop). However, I immediately see that many queries are lost because of not loaded result data. E.g. some mixture of ex-European to Japan flights (e.g. the second line URL):
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTA4agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-08, Price: 4014.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTA5agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEwagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-10, Price: 2940.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-01, Return Date: 2024-09-11, Price: 2940.0
......
......
(after a while it gets even worse) - no parsing at all
Departure: ARN, Arrival: TYO, Departure Date: 2024-09-02, Return Date: 2024-09-10, Price: 3857.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEyagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEzagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDJqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTE0agUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEwagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTExagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEyagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDNqBRIDQVJOcgUSA1RZTxoaEgoyMDI0LTA5LTEzagUSA1RZT3IFEgNBUk5CAQFIAZgBAQ==
......
......
then it shows up sometimes
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTA5agUSA09TQXIFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: OSA, Departure Date: 2024-09-01, Return Date: 2024-09-09, Price: 2789.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTEwagUSA09TQXIFEgNBUk5CAQFIAZgBAQ==
Departure: ARN, Arrival: OSA, Departure Date: 2024-09-01, Return Date: 2024-09-10, Price: 3918.0
https://www.google.com/travel/flights?tfs=GhoSCjIwMjQtMDktMDFqBRIDQVJOcgUSA09TQRoaEgoyMDI0LTA5LTExagUSA09TQXIFEgNBUk5CAQFIAZgBAQ==
My guess is that something should be done within:
def request_flights(tfs: TFSData, **kwargs: Any) -> requests.Response:
like checking the content of
try:
r.raise_for_status()
if "Best departing flights" in r.text:
return r
plus some retries or pauses (sleep). And obviously, it will terribly slow down the scan. Which probably should be done... in parallel. The script (if you wish to test): japan_scan.py.zip
This seems to help in my testing:
def request_flights(tfs: TFSData, **kwargs: Any) -> requests.Response:
max_retries = 4 # Maximum number of retries
wait_time = 10 # Time to wait between retries in seconds
for attempt in range(max_retries):
r = requests.get(
"https://www.google.com/travel/flights",
params={
"tfs": tfs.as_b64(),
"hl": "en",
"tfu": "EgQIABABIgA", # show all flights and prices condition
},
headers={"user-agent": ua, "accept-language": "en"},
**kwargs
)
try:
r.raise_for_status()
# Check if the expected results are ready (you can adjust this part based on the actual response)
if "Best departing flights" in r.text:
return r
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f"Other error occurred: {err}")
# Wait before retrying
time.sleep(wait_time)
# Raise an error if the maximum number of retries is reached
r.raise_for_status()
return r
but this is not the golden solution. There are still responses wihich do not end up with the result on time... even when repeating them (well perhaps one should increase timeout with every attempt).
Hi there,
Thanks for the huge dedication! I'll look into this tomorrow.
P.S. I sincerely apologize for the huge delay. I've been busy this week.
Regards, AWeirdDev
Hi,
First of all, thank you for dedicating a significant amount of time to writing code that works and can serve as a starting point for further development. Many scrapers don't work at all, as you have likely noticed. However, I believe your code has the potential to become a cool tool, operating differently from all those Selenium simulators. On another note, I'm working on something similar with the ITA Matrix service, which is an excellent source for both prices and airfares. In its new version, the address is constructed from JSON to Base64, requiring browser emulation to retrieve the result.
(side example for ITA Matrix)
import requests
import base64
import json
# Your Base64 encoded query string
encoded_query = "eyJ0eXBlIjoib25lLXdheSIsInNsaWNlcyI6W3sib3JpZ2luIjpbIkpGSyJdLCJkZXN0IjpbIkxBWCJdLCJyb3V0aW5nIjoiIiwiZXh0IjoiIiwicm91dGluZ1JldCI6IiIsImV4dFJldCI6IiIsImRhdGVzIjp7InNlYXJjaERhdGVUeXBlIjoiY2FsZW5kYXIiLCJkZXBhcnR1cmVEYXRlIjoiMjAyNC0wOS0xNSIsImRlcGFydHVyZURhdGVUeXBlIjoiZGVwYXJ0IiwiZGVwYXJ0dXJlRGF0ZU1vZGlmaWVyIjoiMCIsImRlcGFydHVyZURhdGVQcmVmZXJyZWRUaW1lcyI6W10sInJldHVybkRhdGVUeXBlIjoiZGVwYXJ0IiwicmV0dXJuRGF0ZU1vZGlmaWVyIjoiMCIsInJldHVybkRhdGVQcmVmZXJyZWRUaW1lcyI6W119fV0sIm9wdGlvbnMiOnsiY2FiaW4iOiJDT0FDSCIsInN0b3BzIjoiLTEiLCJleHRyYVN0b3BzIjoiMSIsImFsbG93QWlycG9ydENoYW5nZXMiOiJ0cnVlIiwic2hvd09ubHlBdmFpbGFibGUiOiJ0cnVlIiwiY3VycmVuY3kiOnsiZGlzcGxheU5hbWUiOiJVbml0ZWQgU3RhdGVzIERvbGxhciAoVVNEKSIsImNvZGUiOiJVU0QifSwic2FsZXNDaXR5Ijp7ImNvZGUiOiJOWUMiLCJuYW1lIjoiTmV3IFlvcmsifX0sInBheCI6eyJhZHVsdHMiOiIxIn19"
# Form the URL
base_url = "https://matrix.itasoftware.com/calendar?search="
full_url = f"{base_url}{encoded_query}"
# Custom headers (replace these with actual headers from your browser)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://matrix.itasoftware.com/",
"Cookie": "YOUR_SESSION_COOKIES_HERE"
}
# Send the GET request with custom headers
response = requests.get(full_url, headers=headers)
# Check the response
if response.status_code == 200:
print(response.url)
print(response.text) # You can parse this HTML using BeautifulSoup as needed
else:
print(f"Failed to retrieve the data. Status code: {response.status_code}")
Hey,
Thanks. I've added your recommendation. (See Project) This could be an alternative when Google Flights isn't working properly.
As a side note, we won't be using BeautifulSoup since it will cause the flow to be prolonged, ultimately blocking our script.
Thanks for your contribution!
Regards, AWeirdDev
Hi,
First of all well done - awesome project!
I get the following error when I try to run the sample code:
result is
Result(current_price='', flights=[])
plz halp :)
P.S. tried adding you on discord but your id isn't working. DM me on twitter (just followed you) - I'm looking to use this API for something fun :)