dgtlmoon / changedetection.io

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification
https://changedetection.io
Apache License 2.0
18.92k stars 1.03k forks source link

[feature] Handle PDF files opened in new tab in playwright. #2019

Open jjmoffitt opened 11 months ago

jjmoffitt commented 11 months ago

Version and OS latest docker

Is your feature request related to a problem? Please describe. When attempting to create a playwright workflow for a site I want a statement from, for example my gas utility bill, what playwright wants to do when you click download statement is open a pdf in a new tab. The steps options have no way of handling that.

Describe the solution you'd like I could see multiple paths for this, either add logic to handle opening a new tab, which I've seen playwright has the ability to do or switch the functionality to download instead of open in new tab. Then either give us the ability to send the pdf as a notification to something or send an image of the page, which if I remember correctly an option now.

My ideal workflow is I can have changedetection look at the statement, see if it's different from before, send that statement in some way to a system that I can then have paperless consume it automatically. Many ways to handle that, like straight to a saved folder that could be mounted in the container, email, rss, etc. I'd be happy to work around whatever you find easiest to develop.

Describe the use-case and give concrete real-world examples I've seen people trying to find ways to automate downloading things like bank statements, credit card statements, utility bills, medical records, you name it for a long time. It's very difficult to handle a software that can do all those built for each system, but I think changedetection is incredibly close to letting us do it generically.

dgtlmoon commented 11 months ago

A bit of random info, so at the moment if you try to force it to use playwright (if you try to access a PDF directly), you get

<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="90C8FDC527DA83A5BC935C860D0AD25B" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="90C8FDC527DA83A5BC935C860D0AD25B"></body></html>

as the content

and this bug just came back https://github.com/dgtlmoon/changedetection.io/pull/2020 , rolling a new subrelease now

dgtlmoon commented 11 months ago

ghost of me previous https://github.com/microsoft/playwright/issues/6091

jjmoffitt commented 11 months ago

I'm going down a hole of trying to learn some of this playwright stuff to see if I can bash it along with you. One thing I noticed is that browserless started supporting firefox in July. And from the playwright ticket it sounds like the pdf handling might work fine on firefox? Would it be an option to try that for the handling instead of chrome? I'll see if I can make it work on my own side just with the container first.