benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

How to download a file with original name or content-disposition ? #85

Closed Baltazar500 closed 2 years ago

Baltazar500 commented 2 years ago

How to download file with original name or content-disposition using xidel without curl/wget ? The "--download=" switch allows you to save the file only with the specified file name :(

Reino17 commented 2 years ago

Hello Baltazar500,

The string to enter for --download is actually an "extended string", but without the opening x" and closing ".
So with --download '{...}' you can insert every variable or function like you would for -e/--extract.

Related (duplicate?) issue: https://github.com/benibela/xidel/issues/38.

Baltazar500 commented 2 years ago

@Reino17, Thanks. It works. But only for single expressions. When using follow page by page I get error after file download

xidel -f 'very-long-expression' --download '{replace($url, "^.*/", "")}' -f '//div[@file="text"]/a/@href'

Save as: List#1.txt
Error:
err:XPTY0004: Need context item that is a node to get root element
Possible backtrace:
  $08120890  TXQUERYENGINE__EVALUATESINGLESTEPQUERY,  line 9358 of /home/benito/hg/components/pascal/data/xquery.pas: perhaps TXQTermTryCatch + 136624 ? but unlikely
  $080F5547  TXQTERMPATH__EVALUATE,  line 3302 of /home/benito/hg/components/pascal/data/xquery_terms.inc: perhaps TXQTermBinaryOp + 3959 ? but unlikely
  $080E090A  TXQUERY__EVALUATE,  line 7524 of /home/benito/hg/components/pascal/data/xquery.pas: perhaps Q{http://www.w3.org/2005/xpath-functions}concat + 41114 ? but unlikely
  $080E0A3D  TXQUERY__EVALUATE,  line 7549 of /home/benito/hg/components/pascal/data/xquery.pas: perhaps Q{http://www.w3.org/2005/xpath-functions}concat + 41421 ? but unlikely
  $08080A64  TPROCESSINGCONTEXT__EVALUATEQUERY,  line 2218 of xidelbase.pas: perhaps ? ? but unlikely
  $0808002C  SUBPROCESS,  line 2062 of xidelbase.pas: perhaps ? ? but unlikely
  $0807F64C  TPROCESSINGCONTEXT__PROCESS,  line 2079 of xidelbase.pas: perhaps ? ? but unlikely
  $080801F0  PROCESSFOLLOWTO,  line 1998 of xidelbase.pas: perhaps ? ? but unlikely
  $08080061  SUBPROCESS,  line 2065 of xidelbase.pas: perhaps ? ? but unlikely
  $0807F8D1  TPROCESSINGCONTEXT__PROCESS,  line 2098 of xidelbase.pas: perhaps ? ? but unlikely
  $0808A3BB  PERFORM,  line 3891 of xidelbase.pas: perhaps ? ? but unlikely
  $080493D9  main,  line 84 of xidel.pas: perhaps ? ? but unlikely

Call xidel with --trace-stack to get an actual backtrace

When using "download" after following to the next page

xidel -f 'very-long-expression' -f '//div[@file="text"]/a/@href' --download '{replace($url, "^.*/", "")}'

file is not downloaded and I get an error

Reino17 commented 2 years ago

That's because, as the error-message mentions, there's no context item. You didn't provide input (a file or an url).

benibela commented 2 years ago

If the download name is a directory, it uses the name from the URL. So you can do --download .

And the last option is better not a -f

Baltazar500 commented 2 years ago

@Reino17

That's because, as the error-message mentions, there's no context item. You didn't provide input (a file or an url).

When using a 'very-long-expression' as an extraction (-e), I get links from each following page. When using follow (and download) it ends on the first page :(

@benibela

If the download name is a directory, it uses the name from the URL. So you can do --download .

This works like expression --download '{replace($url, "^.*/", "")}', but only the file from the base link is loaded. The next (follow) page does not load and throws an error.

Error: err:XPTY0004: Need context item that is a node to get root element

benibela commented 2 years ago

When using a 'very-long-expression' as an extraction (-e), I get links from each following page. When using follow (and download) it ends on the first page :(

Try it with some input

xidel '<start/>' -f 'very-long-expression' -f '//div[@file="text"]/a/@href' --download '{replace($url, "^.*/", "")}'
Baltazar500 commented 2 years ago

@benibela, Sorry, the site I'm extracting data from has stopped working. After he resumes work, I will check this trick. Thanks :)

Baltazar500 commented 2 years ago

My problem was solved by using a loop "[ -f xxx ]"

xidel [-f 'very-long-expression' --download '{replace($url, "^.*/", "")}' ] -f '//div[@file="text"]/a/@href'