carlosrsabreu / devo-abastecer

Twitter bot that publishes weekly the fuel prices updates on Madeira island.
https://twitter.com/devoabastecer
MIT License
10 stars 2 forks source link

[FEATURE]: Get PDF links that contains gas prices from JORAM #10

Closed carlosrsabreu closed 1 year ago

carlosrsabreu commented 2 years ago

Expected Behavior

At this moment, we have a scrapper that reads a JORAM PDF file and returns a dictionary with the gas prices. You can check the scrapper in this file.

You can check some examples here:

image

Actual Behavior

The gas prices are retrieved from the Direção Regional de Economia e Transportes Terrestres website.

However, these prices are not updated every Friday (as the JORAM document is issued with gas prices for the next week), so sometimes we can get the gas prices after the current week and don't have the info in time.

To check how we retrieve the gas prices at this moment, check this file.

joaoofreitas commented 2 years ago

Shall be done.

HarryVasanth commented 1 year ago

@joaoofreitas @carlosrsabreu

This will be a good start, I guess 💁‍♂️:

import requests
from bs4 import BeautifulSoup
import datetime

# Current year
current_year = str(datetime.datetime.now().year)

# JORAM URL
url = f'https://joram.madeira.gov.pt/joram/2serie/Ano%20de%20{current_year}/'

response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')

# Find the newest file by sorting the list of links in reverse order and extract the name of the file from the link
pdf_links = [link for link in links if link['href'].endswith('.pdf')]
newest_pdf_link = sorted(pdf_links, reverse=True)[0]
newest_pdf_filename = newest_pdf_link['href'].split('/')[-1]

# Use the filename to scrape the content
print(newest_pdf_filename)
joaoofreitas commented 1 year ago

The current Issue is stale.

The branch I was working on is this one. This issue is currently stale due to inconsistency in JORAM paper publishing and my lack of time.

Following this, feel free to make this move @HarryVasanth. Always a pleasure to work with you :)

HarryVasanth commented 1 year ago

@joaoofreitas I'll get to it at some point. At least, I will try to. Likewise! 😊

Dntfreitas commented 1 year ago

Another approach: https://cloud.google.com/vision/docs/pdf. This enables to get structured text extracted from a PDF file.

joaoofreitas commented 1 year ago

@Dntfreitas Even though I like the idea, and that it's quite simple to implement. It would create a dependency to Google Services. Which can be perfectly avoidable.

Is there a way to use the AI independently that I am missing out?

Anyways, this is a matter of decision making by @carlosrsabreu.

Feedback is appreciated too: @13dev @HarryVasanth

carlosrsabreu commented 1 year ago

@joaoofreitas yes, we can avoid creating that dependency. Also, @Dntfreitas said that is a paid service (the first 1000 requests are free), but I believe we can do it without using it.

We can follow the initial idea of having the scrapper made only by Python code.

Let's get a solution for this issue and we get the first stable version of the app, hopefully! 🙏

HarryVasanth commented 1 year ago

@joaoofreitas @carlosrsabreu

I agree with keeping up with the original scraper. The fewer dependencies, the fewer things that we need to worry about getting deprecated. Also, keeping it primitive & minimal, means it's easier to fix it than going on an api witch-hunt. 💁

Dntfreitas commented 1 year ago

True. But humans are inconsistent. This week they write in one way, and the other week in a completely different way.

I agree; we should try first the original solution. If it does not work, move the solution that I proposed. Or other that we find at the time 👌

carlosrsabreu commented 1 year ago

Expected Behavior

At this moment, we have a scrapper that reads a JORAM PDF file and returns a dictionary with the gas prices. You can check the scrapper in this file.

  • The PDF files can be found in this link;
  • The weekly gas prices are issued on Thursdays or Fridays (most of them on Fridays);
  • The 2.º Suplemento PDF files are the ones that contain gas prices (but not all of them).

You can check some examples here:

image

Actual Behavior

The gas prices are retrieved from the Direção Regional de Economia e Transportes Terrestres website.

However, these prices are not updated every Friday (as the JORAM document is issued with gas prices for the next week), so sometimes we can get the gas prices after the current week and don't have the info in time.

To check how we retrieve the gas prices at this moment, check this file.

EDIT: The 2.º Suplemento PDF files sometimes are not the ones that contain gas prices (but not all of them). Here is an example: https://joram.madeira.gov.pt/joram/2serie/Ano%20de%202023/IISerie-020-2023-01-27Supl.pdf For this reason, we should scrap all of them, so we can't create a rule to scrap just a few ones.

@HarryVasanth @Dntfreitas @joaoofreitas @13dev

joaoofreitas commented 1 year ago

Let's bruteforce all of them. Should not be a problem.

HarryVasanth commented 1 year ago

@carlosrsabreu

For this reason, we should scrap all of them, so we can't create a rule to scrap just a few ones.

@joaoofreitas

Let's bruteforce all of them. Should not be a problem.

or... we sort it as we do, and iterate one at a time until we get to a valid fuel price? 🤔

graph TD
B[Sort PDFs by date]
B --> C[Check first PDF for *fuel price*]
C --> D{PDF contains *fuel price*?}
D -- Yes --> E[Stop iterating and use this PDF]
D -- No --> F[Check next PDF for *fuel price*]
F --> D
joaoofreitas commented 1 year ago

That's exactly the solution I was thinking about.

Let's keep it on!

HarryVasanth commented 1 year ago

That's exactly the solution I was thinking about.

Let's keep it on!

Done 😅 Updated: #23