MontFerret / ferret

Declarative web scraping
https://www.montferret.dev/
Apache License 2.0
5.72k stars 299 forks source link

Is there a way to always get absolute URLs? #576

Open gonssal opened 3 years ago

gonssal commented 3 years ago

I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.

I'm crawling a site by getting a bunch of href attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.

The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a DOCUMENT from one of the relative URLs, I get the following error:

Failed to execute the query
failed to retrieve a document /whichever: Get /whichever: unsupported protocol scheme "": DOCUMENT(url) at 11:16: FORurlinurlsLETpropDoc=DOCUMENT(url)RETURN{...} at 10:1

I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.

ziflex commented 3 years ago

If it's relative, why don't you just concat it with a base url?

doc.url + link.attributes.href
gonssal commented 3 years ago

If it's relative, why don't you just concat it with a base url?

doc.url + link.attributes.href

Because as I explain in the issue, there's both relative and absolute URLs. In the third paragraph specifically.

ziflex commented 3 years ago

You can do something like this:

LET href = link.attributes.href
LET url = CONTAINS(href, "http") ? href : doc.url + link.attributes.href

I might add helper functions for url manipulations in the future release.

gonssal commented 3 years ago

I ended up using FIND_FIRST instead, thank you.

I think it would be really nice to automatically convert all relative paths in href, src, etc... in the same way web broswers do, if you hover a link it will always show the absolute URL it points to. Considering this is a crawling tool, I don't think relative URLs make a lot of sense.

This is also specially true for URI fragments. For example if I'm in https://example.com/some-url and there's an anchor with href="#marker", with your proposed solution I'd get https://example.com/#marker instead of the correct https://example.com/some-url#marker.