Open gonssal opened 3 years ago
If it's relative, why don't you just concat it with a base url?
doc.url + link.attributes.href
If it's relative, why don't you just concat it with a base url?
doc.url + link.attributes.href
Because as I explain in the issue, there's both relative and absolute URLs. In the third paragraph specifically.
You can do something like this:
LET href = link.attributes.href
LET url = CONTAINS(href, "http") ? href : doc.url + link.attributes.href
I might add helper functions for url manipulations in the future release.
I ended up using FIND_FIRST
instead, thank you.
I think it would be really nice to automatically convert all relative paths in href
, src
, etc... in the same way web broswers do, if you hover a link it will always show the absolute URL it points to. Considering this is a crawling tool, I don't think relative URLs make a lot of sense.
This is also specially true for URI fragments. For example if I'm in https://example.com/some-url
and there's an anchor with href="#marker"
, with your proposed solution I'd get https://example.com/#marker
instead of the correct https://example.com/some-url#marker
.
I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.
I'm crawling a site by getting a bunch of
href
attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a
DOCUMENT
from one of the relative URLs, I get the following error:I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.