Garmelon / PFERD

Programm zum Flotten, Einfachen Runterladen von Dateien
Other
127 stars 24 forks source link

Contents of type "Inhaltsseite" won't get crawled #50

Open Geronymos opened 2 years ago

Geronymos commented 2 years ago

My analysis course uses the structure of "Inhaltsseite" (icon looks like a laptop showing a diagram) to provide the script (which gets updated regularly) as well as the exercise sheets and its solutions.

Unfortunately I can't download them with pferd. I tried using the command line, the config file downloading the whole course and explicit URL but nothing is working.

When executing pferd kit-ilias-web [url] . it just says

Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Crawled     '.'

Report for crawl:ilias
  Nothing changed

And the folder stays empty.

Is this a misconfiguration on my end or is this type of structure not implemented yet?

I-Al-Istannen commented 2 years ago

Is this a misconfiguration on my end or is this type of structure not implemented yet?

I'd guess the latter, could you pass the --explain switch as the first parameter to pferd (before the kit-ilias-web)? Then PFERD should try and explain itself, maybe it will tell you that it has no idea what's happening.

Geronymos commented 2 years ago

Here is the output with the --explain-flag

Loading config
  CLI command specified, loading config from its arguments
  Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
  No crawlers specified on CLI
  Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Loading cookies
  Sharing cookies
  '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
  Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
  Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
  Final result: '.'
  Answer: Yes
Parsing root HTML page
  URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
  Page is a normal folder, searching for elements
Crawled     '.'
Decision: Clean up files
  No warnings or errors occurred during this run
  Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
  Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
  Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'

Report for crawl:ilias
  Nothing changed
I-Al-Istannen commented 2 years ago

Yea, so it apparently did not recognize anything useful. I will have a look at it, but not before the ILIAS 7 migration in a few days if that's alright with you. That one will probably absolutely slaughter the HTML parser anyways :P

I-Al-Istannen commented 2 years ago

Could you have a look at what https://github.com/Garmelon/PFERD/releases/tag/v3.3.0 produces @Geronymos?

Geronymos commented 2 years ago

Even though pferd 3.3 can download all regular content again (thank you for that!), it unfortunately still downloads nothing for those types of links. But it recognizes that it is of type content page (see explain log).

As I see it "Inhaltsseite" might be an option for the lecturer to write pure html. So maybe it could be handled like a "external link": downloaded as plaintext and download links within the page.

explain-log ``` Loading config CLI command specified, loading config from its arguments Creating config for command 'kit-ilias-web' Deciding which crawlers to run No crawlers specified on CLI Running crawlers specified in config Loading auth:ilias Loading crawl:ilias Running crawl:ilias Loading cookies Sharing cookies '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies' Creating base directory at '/home/me/documents/uni/ilias/ana_blatt' Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report' Loaded report successfully Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv Decision: Crawl '.' Final result: '.' Answer: Yes Parsing root HTML page URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv Page is a content page, searching for elements Crawled '.' Decision: Clean up files No warnings or errors occurred during this run Answer: Yes Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report' Stored report successfully Total amount of HTTP requests: 1 Saving cookies Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies' Report for crawl:ilias Nothing changed ```
I-Al-Istannen commented 2 years ago

The "content page" has a "file" feature which I added support for. I thought they were nice enough to use it but they are not...

I don't really want to crawl random pages linked by the content page - that could lead to weird network requests, errors when the remote file is behind authentication and so on. I was about to suggest writing a dedicated crawler type for the math page but they don't even link them there... So I guess I will have to find a compromise here.

  1. I could do a HEAD to find out the content type of the remote server and store it as an "external link" file if it is text/html and otherwise download it, but that would cause an additional network request for each item - even if they are already present locally.

  2. Slightly less fancy, I could just use the name of the link and perform the same check. That would allow me to do this in one request and not do anything if it is present locally, but the file extension will be off.

  3. As a third option I could just download them as-is and you might end up with downloaded HTML files if it links to things which can not be downloaded directly.

All of these will lead to errors if there are links to files behind authentication.