ail-project / lacus

Lacus is a capturing system using playwright, as a web service.
BSD 3-Clause "New" or "Revised" License
39 stars 3 forks source link

Documentation Request regarding cookies usage #18

Closed CarlosLannister closed 2 months ago

CarlosLannister commented 2 months ago

Hello,

As mentioned in the documentation there is a possibility of providing cookies whenever a new URL is enqueued.

Could you provide any examples of this functionality? I tried to give cookies values in some test requests such as:

"cookies": "[{ \"captcha_uid\":\"c6b21d73-a9fb-42c4-afa8-11f5774fe23f\"}]",

But playwright seems to expect some name and value keys:

"cookies": "[{ \"name\": \"captcha_uid\", \"value\": \"c6b21d73-a9fb-42c4-afa8-11f5774fe23f\"},]",

None of my tries seemed to work.

Thanks in advance

Rafiot commented 2 months ago

This is a very good point, and both request should work, and they don't right now. I'll work on that.

In the meantime, the confirmed working cookies are the two following options:

In theory, [{ "name": "captcha_uid", "value": "c6b21d73-a9fb-42c4-afa8-11f5774fe23f"}] should work, but I guess missing the other settings cause it to be ignored by Playwright.

Regardless, I'll investigate asap and at least the second approach will work. The first one may, but I need to give it a shot.

CarlosLannister commented 2 months ago

Hi,

First of all, thank you very much for the prompt response. I have been trying the Cookie Quick Manager addon exporting a JSON file.

Such as:

[{
    "Host raw": "http://domain.onion/",
    "Name raw": "courier_market_session",
    "Path raw": "/",
    "Content raw": "ADLfJeZMFFOCh54HMqeFZ9DsByZzOuxX8ij02Opd",
    "Expires": "At the end of the session",
    "Expires raw": "0",
    "Send for": "Any type of connection",
    "Send for raw": "false",
    "HTTP only raw": "true",
    "SameSite raw": "lax",
    "This domain only": "Valid for host only",
    "This domain only raw": "true",
    "Store raw": "firefox-private",
    "First Party Domain": "domain.onion"
},
{... 

Then I tried to load it and enqueued using Pylacus:

with open(args.cookies, 'r') as f:
      cookies = json.load(f)

uuid = self.lacus.enqueue(url=url, cookies=cookies)

But got the following error in Lacus:

2024-06-18 11:20:09,595 LacusCore DEBUG:[293a5000-b712-4baa-9372-3253c9809131] Capturing http://domain.onion
2024-06-18 11:20:09,987 LacusCore ERROR:[293a5000-b712-4baa-9372-3253c9809131] Something went poorly http://domain.onion - 'name'
Traceback (most recent call last):
  File "/home/carlos/.cache/pypoetry/virtualenvs/lacus-gI8B2ZCa-py3.10/lib/python3.10/site-packages/lacuscore/lacuscore.py", line 622, in _capture
    capture.cookies = to_capture.get('cookies')  # type: ignore[assignment]
  File "/home/carlos/.cache/pypoetry/virtualenvs/lacus-gI8B2ZCa-py3.10/lib/python3.10/site-packages/playwrightcapture/capture.py", line 244, in cookies
    'name': cookie['name'],
KeyError: 'name'
2024-06-18 11:20:09,987 LacusCore WARNING:[293a5000-b712-4baa-9372-3253c9809131] Unable to capture http://domain.onion: Something went poorly http://domain.onion - 'name'
2024-06-18 11

I will try to debug it in further detail and come back with more info.

Rafiot commented 2 months ago

Just a note, the core reason neither of your tries work is because playwright requires url or a domain/path pair, so just name/value won't work. (A bit unclear) documentation: https://playwright.dev/python/docs/api/class-browsercontext#browser-context-add-cookies

I'll find a solution for that to allow a generic domain and path.

--

The issue with the export from Quick Cookie Manager is because I have a converter, but it is on Lookyloo side: https://github.com/Lookyloo/lookyloo/blob/main/lookyloo/helpers.py#L238

So that's not great when you're using lacus only. I'll see if it makes sense to move that converter to LacusCore.

--

And finally the exception is because PlaywrightCapture expects a "name" key: https://github.com/Lookyloo/PlaywrightCapture/blob/main/playwrightcapture/capture.py#L267

--

My current expected fix is:

  1. See what kind of generic domain/path I can pass when they're not present
  2. If the dict has only one key/value assume it is a name/value pair and process the dict accordingly
Rafiot commented 2 months ago

Just to confirm, if you pass a list of dictionaries with at least a name, value, domain and path, it works.

And one more thing: the domain value limits where the cookie will be sent (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#domaindomain-value). There is no way to tell the browser "just whatever, send it along with every GET requests", so the best I can do is, if the domain isn't set in the query, to set it to the domain of the capture. But if the cookie is for a captcha, and the captcha in on another domain, we have no way to know and the cookie won't be there :(

Rafiot commented 2 months ago

Alright, I pushed a fix (our best option as a fix at least).

It requires updating LacusCore (and give me a sec for that, I need to package it first) - and be careful, if the domain and path isn't in the cookies parameter, we use the hostname of the URL we capture. If the cookies is aimed at another domain, it will be ignored.

Rafiot commented 2 months ago

More details: https://github.com/ail-project/LacusCore/releases/tag/v1.9.6