Closed carlosrsabreu closed 1 year ago
Shall be done.
@joaoofreitas @carlosrsabreu
This will be a good start, I guess 💁♂️:
import requests
from bs4 import BeautifulSoup
import datetime
# Current year
current_year = str(datetime.datetime.now().year)
# JORAM URL
url = f'https://joram.madeira.gov.pt/joram/2serie/Ano%20de%20{current_year}/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
# Find the newest file by sorting the list of links in reverse order and extract the name of the file from the link
pdf_links = [link for link in links if link['href'].endswith('.pdf')]
newest_pdf_link = sorted(pdf_links, reverse=True)[0]
newest_pdf_filename = newest_pdf_link['href'].split('/')[-1]
# Use the filename to scrape the content
print(newest_pdf_filename)
The current Issue is stale.
The branch I was working on is this one. This issue is currently stale due to inconsistency in JORAM paper publishing and my lack of time.
Following this, feel free to make this move @HarryVasanth. Always a pleasure to work with you :)
@joaoofreitas I'll get to it at some point. At least, I will try to. Likewise! 😊
Another approach: https://cloud.google.com/vision/docs/pdf. This enables to get structured text extracted from a PDF file.
@Dntfreitas Even though I like the idea, and that it's quite simple to implement. It would create a dependency to Google Services. Which can be perfectly avoidable.
Is there a way to use the AI independently that I am missing out?
Anyways, this is a matter of decision making by @carlosrsabreu.
Feedback is appreciated too: @13dev @HarryVasanth
@joaoofreitas yes, we can avoid creating that dependency. Also, @Dntfreitas said that is a paid service (the first 1000 requests are free), but I believe we can do it without using it.
We can follow the initial idea of having the scrapper made only by Python code.
Let's get a solution for this issue and we get the first stable version of the app, hopefully! 🙏
@joaoofreitas @carlosrsabreu
I agree with keeping up with the original scraper. The fewer dependencies, the fewer things that we need to worry about getting deprecated. Also, keeping it primitive & minimal, means it's easier to fix it than going on an api witch-hunt. 💁
True. But humans are inconsistent. This week they write in one way, and the other week in a completely different way.
I agree; we should try first the original solution. If it does not work, move the solution that I proposed. Or other that we find at the time 👌
Expected Behavior
At this moment, we have a scrapper that reads a JORAM PDF file and returns a
dictionary
with the gas prices. You can check the scrapper in this file.
- The PDF files can be found in this link;
- The weekly gas prices are issued on Thursdays or Fridays (most of them on Fridays);
- The 2.º Suplemento PDF files are the ones that contain gas prices (but not all of them).
You can check some examples here:
Actual Behavior
The gas prices are retrieved from the Direção Regional de Economia e Transportes Terrestres website.
However, these prices are not updated every Friday (as the JORAM document is issued with gas prices for the next week), so sometimes we can get the gas prices after the current week and don't have the info in time.
To check how we retrieve the gas prices at this moment, check this file.
EDIT: The 2.º Suplemento PDF files sometimes are not the ones that contain gas prices (but not all of them). Here is an example: https://joram.madeira.gov.pt/joram/2serie/Ano%20de%202023/IISerie-020-2023-01-27Supl.pdf For this reason, we should scrap all of them, so we can't create a rule to scrap just a few ones.
@HarryVasanth @Dntfreitas @joaoofreitas @13dev
Let's bruteforce all of them. Should not be a problem.
@carlosrsabreu
For this reason, we should scrap all of them, so we can't create a rule to scrap just a few ones.
@joaoofreitas
Let's bruteforce all of them. Should not be a problem.
or... we sort it as we do, and iterate one at a time until we get to a valid fuel price? 🤔
graph TD
B[Sort PDFs by date]
B --> C[Check first PDF for *fuel price*]
C --> D{PDF contains *fuel price*?}
D -- Yes --> E[Stop iterating and use this PDF]
D -- No --> F[Check next PDF for *fuel price*]
F --> D
That's exactly the solution I was thinking about.
Let's keep it on!
That's exactly the solution I was thinking about.
Let's keep it on!
Done 😅 Updated: #23
Expected Behavior
At this moment, we have a scrapper that reads a JORAM PDF file and returns a
dictionary
with the gas prices. You can check the scrapper in this file.You can check some examples here:
Actual Behavior
The gas prices are retrieved from the Direção Regional de Economia e Transportes Terrestres website.
However, these prices are not updated every Friday (as the JORAM document is issued with gas prices for the next week), so sometimes we can get the gas prices after the current week and don't have the info in time.
To check how we retrieve the gas prices at this moment, check this file.