Add error handling and enhance logging

desyncr commented 4 years ago

Google drive seems to be returning errors regarding automated requests. Kobocloud is not handling such scenarios appropriately.

In this PR I'm:

Add error handling for GDrive (4xx and 5xx)
Add a new configuration for User Agent (maybe this will fool google, or reduce detection rate)
Enhance logging format

Example output:

2020-05-10_15:31:48 waiting for internet connection
Reading https://drive.google.com/drive/folders/<ID>
Getting https://drive.google.com/drive/folders/<ID>
Getting https://drive.google.com/drive/folders/<ID>
<ID>
File info: <FILEID>,x22<FILENAME>
File code: <FILEID>
File name: <FILENAME>
Using custom userAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Getting remote file information:
  Command: '/usr/local/kobocloud/curl --cacert "/usr/local/kobocloud/ca-bundle.crt"  -A 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'  -k -L --silent --head https://drive.google.com/uc?id=<ID>&export=download'
  Status: 0
Remote file information:
  Remote size: 
  Status code: 405
Error: Forbidden
Having problems contacting Google Drive. Try again in a couple of minutes.
Reading #UNINSTALL
Comment found
2020-05-10_15:31:59 done

fsantini commented 4 years ago

Hi! Nice PR, thank you! Have you tried saving the output of $curlHead in a variable to avoid creating a temporary file?

remoteInfo=`$curlHead`
remoteSize=`echo $remoteInfo | tr A-Z a-z | sed -n 's/^content-length\: \([1-9]*\).*/\1/p'`
statusCode=`echo $remoteInfo | grep 'HTTP/' | tail -n 1 | cut -d' ' -f2`

I don't know if it would work but it would be neater.

desyncr commented 4 years ago

@fsantini Yeah, that was my first approach but -for some reason I still don't know- it was not saving any output to the variable. I'll try again later.

desyncr commented 4 years ago

One thing I noticed is that Google blocks you once you go over an unknown # of requests (from your IP). It blocks you and unless you're logged in into a Google account you wont be able to download from Drive.

Taking that into account, these changes maybe will reduce the detection rate. But for sure you'll be blocked if you play around syncing too much. In that case with these new changes you'll know for sure what's going on.

desyncr commented 4 years ago

Another thing is that they stop to respond to HEAD requests once you're blocked. So, the file size check fails. But I think we can get around that just checking the status code.

desyncr commented 4 years ago

Removed user-agent and custom headers configuration as it causes curl to output the binary contents into the log file. Not sure why that happens but -A and -H causes this behavior.

desyncr commented 4 years ago

I'm wondering if we could avoid getting size information with -I (which doesn't work, AFAIK) and perform the download straight away. This way we'll be doing half the requests.

fsantini commented 4 years ago

I'm wondering if we could avoid getting size information with -I (which doesn't work, AFAIK) and perform the download straight away. This way we'll be doing half the requests.

Yes of course. You would lose the resume functionality, so I wouldn't disable it for all downloads. Dropbox and pCloud work for example. Maybe it can be an extra parameter in getRemoteFile.sh

fsantini commented 4 years ago

Removed user-agent and custom headers configuration as it causes curl to output the binary contents into the log file. Not sure why that happens but -A and -H causes this behavior.

I think that the way you are passing arguments to getRemoteFile.sh, your "extra" parameter gets exploded in getGDriveFile and what getRemoteFile sees as fourth parameter is only the -H. Try echoing the $extra variable in getRemoteFile to see if this is the case.

desyncr commented 4 years ago

I'm wondering if we could avoid getting size information with -I (which doesn't work, AFAIK) and perform the download straight away. This way we'll be doing half the requests.

Yes of course. You would lose the resume functionality, so I wouldn't disable it for all downloads. Dropbox and pCloud work for example. Maybe it can be an extra parameter in getRemoteFile.sh

Great! I'm gonna disable this only for GDrive. This will disable resume functionality meaning that a partially downloaded file will not be resumed: For this you'll have to manually delete the file and sync again.

I think that the way you are passing arguments to getRemoteFile.sh, your "extra" parameter gets exploded in getGDriveFile and what getRemoteFile sees as fourth parameter is only the -H. Try echoing the $extra variable in getRemoteFile to see if this is the case.

I'll check that out. Thanks for the heads up. Meanwhile I'm gonna reduce this PR's scope to address only the 5xx and 4xx properly and add logging information.

So for now this is my checklist:

[ ] Disable resume functionality from GDrive: Currently it doesn't work anyways and may contribute to trigger Google to block further downloads.
[ ] Update README.md with some heads up to diagnose issues with Gdrive and what to do.

desyncr commented 4 years ago

If anyone could try this out I'd appreciate it. I tested with GDrive and Dropbox.

A few notes:

There's a lot of debugging information being logged. This is temporary I'll remove it later.
Nothing is being deleted in anyway.
If you got a partial download from a 403 error (GDrive) you'll have to remove it manually (the file will be present but it'll be corrupted).

fsantini commented 4 years ago

I tested the PR and it seems to work. I also edited the pcloud code because the api changed recently.

desyncr commented 4 years ago

Awesome!

fsantini / KoboCloud

Add error handling and enhance logging #20