Reading a remote file as source

sntran commented 3 years ago

Hi there,

First, thank you so much for making this library, it works really well.

Recently, I need to upload a file from a Google Drive. I use rclone to handle the remote storage.

I was thinking of using this for the input to nyuu:

procjson://"[name]",[size_in_bytes],"rclone cat 'drive:path/to/file'"

In which rclone cat is a command that pipes the content of a remote file to stdout. The size of the file can also be calculated by rclone, so that variable is fine.

However, from the look of it, it seems that nyuu will try to read the whole file first before starting the upload. This is extremely slow since the file is basically downloaded fully first.

I assume that's the case, based on this log line, which stays on screen for a while:

[INFO] Reading file Big_Buck_Bunny_4K.webm

Is that the default behaviour of nyuu? Can we make it so that it will start uploading immediately as the file is being downloaded from the remote?

animetosho commented 3 years ago

It should be uploading whilst it receives data from the process.
Note that read buffers are used, so if the download is particularly slow, it can take a while to fill. By default, it tries to read 1400KB (2x 700KB articles) before doing any uploads; you can change this behaviour with the --disk-req-size=700K option.
Also check that rclone cat isn't impeding the process with its own buffering.

The reading line in the output indicates that it's begun reading the file. The progress indicator shows how much has been uploaded, so if that's moving, it should be progressing. You can also use the -v switch which will display every article uploaded once it's been posted.

You can also try reducing both --disk-req-size and --article-size to something really small to check that it's not trying to download the whole file before uploading.

sntran commented 3 years ago

Hi,

I don't think the download speed is slow. I usually get around 20MB/s downloading. I think the upload is the bottleneck, as it seems to be around 9MB/s.

However, there was no progress indicator. The last line was that "Reading file" and it stuck there.

I may need to note, but the file was 3GB, so even with --disk-req-size=1400K, it should continue reading.

I understand that it may be hard for you to debug when you don't use rclone. If you're willing, I can see if there is a way for me to set up a test project for you to try.

animetosho commented 3 years ago

Right, thanks for the info.

Are you able to check that the download is actually occurring (e.g. rclone isn't halted waiting on user input)?

You may be able to check with something like:

rclone cat ... | nyuu 'procjson://"filename",100,0' ...

(the 0 there indicates stdin)
...which doesn't suppress stderr/stdin of rclone

I just tested rclone cat with a large local file off a slower drive, and can see it progressing as it reads the file, so it seems to work there. I can't test with a Google Drive backend though (account got banned).

If you want to try taking rclone out of the equation, you can try a wget/curl command with a URL like http://cachefly.cachefly.net/100mb.test to see if it seems to progress with that.

sntran commented 3 years ago

Hi again,

Yes, it seems to be a false alarm. rclone did download, and nyuu did upload. I believe I only saw the line [INFO] Reading file Big_Buck_Bunny_4K.webm without the progress indicator is because I used Node.js to spawn nyuu and pipe its stderr to my stream, which does not handle progress-style output.

Which is kinda suck, as nyuu progress bar provides lots of information. I suppose it only works when connected to a real terminal. I have checked the --progress options, and none of the other options works for me. But that is beyond the scope of this issue.

Closing the issue. Thanks for explaining!

animetosho commented 3 years ago

Thanks for finding that and reporting back!

If you have suggestions on what could help regarding the progress indicator, feel free to mention them.

sntran commented 3 years ago

I'm not sure how I could be of any help, since I am not very familiar with the way the console outputs text.

FWIW, I am using Node.js readline module to read the lines from stderr, where nyuu outputs progress. I think the way that you handle the progress bar is not read correctly or ignored by readline. That's probably why I stopped seeing further line after [INFO] Reading file Big_Buck_Bunny_4K.webm.

As far as I understand, readline looks for \n, \r, or \r\n to determine a line. Looking at these lines, I don't see any of those end-of-line characters.

This is where I stopped, as I have no clue on the way the cursor works. However, if there is a way to determine whether a terminal is not attached to stderr, and switch to using \r\n, that may get the progress bar to work with readline. It will probably output repeating progress line, but it's fine, as I can handle it.

animetosho commented 3 years ago

By default, Nyuu doesn't spit out any progress if it's not connected to a terminal. The idea being that terminal escape codes don't mean a whole lot outside the terminal (particularly if you're redirecting the output to a file).

Perhaps you might want to try --progress log:1s which outputs a line of progress every second. Alternatively, you could use the TCP server and query the status for progress.

Unfortunately the output Nyuu generates is meant to be human read, so isn't the most friendly to being parsed.

sntran commented 3 years ago

Right. Thanks for walking with me through the trouble.

I tried the TCP option first, but setting up a TCP socket that keeps polling the server was a little too much for my script. Not a big deal, but parsing the log is easier :) I could got the data I want from the log. Thanks again!

Not sure if I should open a new issue for this, but I think it would be better if you can also support another option for --progress that can take a template string for the log. For example, the default would be:

"Article posting progress: {articlesRead} read, {articlesPosted} posted, {articlesChecked} checked"

These variable replacement can be the properties of the uploader. It would also be helpful if we can get the average transfer speed from it.

I can look into a PR for it if your hands are full :)

animetosho commented 3 years ago

Oh yeah, log doesn't display totals/percent progress.

If the aim is to have it read by an application, it probably makes sense to adopt something more designed for parsing (like outputting everything in JSON). A template formatting string just seems like something that is only applicable to a specific scenario.
Sound like it could work for you?

sntran commented 3 years ago

JSON is easier to parse, but it will need to include all the data.

Template formatting string allows the user to choose what they want to include.

animetosho commented 3 years ago

but it will need to include all the data.

Not sure if I missed something, but is that an issue?

sntran commented 3 years ago

Sorry, I didn't mean it as an issue. But personally, if I plan to read the log from an application, parsing a full JSON blob every second or so through stderr seems wasteful, when I mostly will just convert it into a progress line. But that is just my use case. I have no objection to having JSON.

animetosho commented 3 years ago

The idea is not to be restricted to the use case of a single application. For example, if someone wanted to create a GUI, then text is likely not what they want.

JSON parsing is at least tens of megabytes per second - the difference between 100 or 1000 bytes is like less than a fraction of a millisecond. Yes, there's some waste, but it's largely irrelevant, particularly in comparison to the cost of IPC.

Thanks for the suggestion nonetheless!

animetosho / Nyuu

Reading a remote file as source #88