danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.31k stars 166 forks source link

Specifying output directory #153

Open rickcecil opened 1 year ago

rickcecil commented 1 year ago

Feature description

Describe the feature you're proposing. AFAICT, the -o option only works with a relative path AND with a filename specified. It would be great if

1) I could use absolute paths so that my scripts are less fragile.

2) If I only specified a directory, it would use the title of the page as the title of the doc in the same manner that it does now if you do not use the -o option.

Existing workarounds

Is there any way to obtain the desired effect with the current functionality? With a bash script, I could wget the web page and then use pup to put the title into a variable that could be used in percollate as the filename. (There's more to it than that as I would also want to clean the title to make sure there are no illegal characters and make sure it did not exceed the character count limit.)

danburzo commented 1 year ago

Hi, thanks for logging the issue.

  1. I could use absolute paths so that my scripts are less fragile.

The -o / --output path can be absolute, but admittedly the help text makes it sound that only relative paths are allowed. It was meant along the lines of when relative, it's relative to the current working directory, which incidentally is how paths normally work 😅 so maybe just drop the whole 'relative' part. Please note that the directory does need to exist, as percollate does not mkdir -p itself to the destination.

  1. If I only specified a directory, it would use the title of the page as the title of the doc in the same manner that it does now if you do not use the -o option.

Using the --individual flag effectively turns the value of -o into a prefix, to which the web page titles are appended. So ending with a trailing slash (-o my/destination/) will create files inside destination. However, it can benefit from some cosmetic tweaks (it makes filenames start with a hyphen currently).

With a bash script, I could wget the web page and then use pup to put the title into a variable that could be used in percollate as the filename. (There's more to it than that as I would also want to clean the title to make sure there are no illegal characters and make sure it did not exceed the character count limit.)

The titles are currently transformed with slugify, but they might benefit from stricter rules, e.g. filenamify + truncation. Do you have a hard limit on the filename length, or just a preference?

rickcecil commented 1 year ago

Appreciate the fast response!

Just a heads up, I am on Ubuntu 20.04 and am using version 4.0 of Percollate.

The -o / --output path can be absolute, but admittedly the help text makes it sound that only relative paths are allowed. It was meant along the lines of when relative, it's relative to the current working directory,

Huh. I tried it a few times, but it kept throwing an error. In fact, just tried it again and it is still throwing errors. Here's what I'm doing:

percollate pdf http://example.com/article.html -o /path/for/file/

I get this error:

[Error: EISDIR: illegal operation on a directory, open '/path/for/file/'] { errno: -21, code: 'EISDIR', syscall: 'open', path: '/path/for/file/' }

At first, I thought permissions error — because that's the first place you check, but the permissions are correct on my directory. And, when I do this, it works:

percollate pdf http://example.com/article.html -o /path/for/file/file.pdf

Then I saw the note about relative paths and figured that was the cause.

Using the --individual flag effectively turns the value of -o into a prefix,

percollate pdf http://example.com/article.html -o /path/for/file/ —individual

Now, this command works as expected, though, as you say, it adds a hyphen in front of the filename

Something to note: this does not work:

percollate pdf http://example.com/article.html -o /path/for/file —individual

Notice the missing trailing slash at the end of "/path/for/file" It tries to create this and, at least in my attempts, fails:

/path/for/file-example.com/article.html

Given your description, I see why it works that way, but did want to point out something that people might miss

The titles are currently transformed with slugify

Sorry, my initial comment was just my thought experiment on how I would accomplish this without percollate. I've been bit before by the creation of a filename that was too long and it was a serious PITA to figure out how to delete that file. So now I am just extra cautious about length and illegal character output. Sounds like you've got that handled in percollate, though.

Lastly, I thought you might get a kick out of what I am trying to do... Basically, a few web pages are not allowing percollate to access the entire html page, but I've found that if I have singlefile grab the site first and send the result to stdout, then percollate can pull the html from stdout and create a new PDF or epub of the entire document. :D Something like this:

singlefile https://medium.com/article-file-name.html | percollate epub - --url=https://medium.com/ -o /path/to/save/ --individual

Anyway, thanks for the quick response. It seems like the best way to get what I want with what is already there would be to use the -- individual option and then, maybe at the end, rename the files to remove the initial dash. Since this is happening in a bash script, that should be pretty easy to do.