coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Store the HTML page URL pointing at the PDF in the artifact metadata #10

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

Currently when the export script (in python, part of the coherent-orgs repo) runs, it saves the URL of the PDF in the artifact metadata 'url' value.

When there is an HTML anchor URL leading to the PDF report, which should be true in all cases (except when the PDF is a seed url, which should be extremely rare), then we want to select one of the anchor pages, and make that the page that we use for the artifact 'url'. This will then lead users in POCO to the page that points at the PDF rather than to the PDF directly.

The advantage of this is that publishers will then get analytics about visits from POCO users. Analytics are not possible for PDF requests because they do not emit the GA tracking tag.

One thing to check: the outbound links data structure used in Nutch has the anchor text and the anchor URL, but I'm not sure if it has the URL of the page the anchor came from -- this needs to be determined.

In many cases there is more than one anchor leading to the URL. So we will need heuristics for selecting which one. The main variation might be the depth of the page that contains the anchor. We could see if lowest depth anchor pages are better destinations that deepest depth anchor pages. Intuition suggests that lower depth is best, since deeper pages might be citations to the report and not the report listing or view page.

Another might be to select the anchor with the word 'PDF' in it (e.g. 'Download PDF' or 'View PDF') because if the anchor is not using the title as text, then we can assume the title is in the body of the HTML surrounding the anchor. (In other cases where the title is included in the anchor, this may be another clue that we are looking at a citation reference to the report, probably from within another report.)

Open question to André: when using the HTML page for the artifact url, its there another schema element used for the PDF itself? Yes: file_url

PeterCiuffetti commented 3 years ago

Im marking this as a small, even though its uncertain if I can get as the URL I need.

PeterCiuffetti commented 3 years ago

This is done. I select the shortest inlink to the PDF as a artifact.uri.