airdcpp-web / airdcpp-release-validator

AirDC++ extension that performs various validations for release directories
1 stars 3 forks source link

Validator fails releases with files in skiplist #10

Closed gangefors closed 3 years ago

gangefors commented 4 years ago

I don't know if this is intentional or not. But since the new release of this extension started validating bundles in the share I started to note that it blocked a lot of bundles from being shared due to files on disk that would fail a bundle. These files are in the skiplist and never becomes part of the shared bundle and should not block a bundle from being shared. It's fine to warn about them (row 1,2 in the example), but it should not block the bundle from being shared.

Example:

[1] 09:58:43 [airdcpp-release-validator] /Share/***/: Multiple NFO files
[2] 09:58:43 [airdcpp-release-validator] /Share/***/: Extra files in release directory (1 file(s): .date)
[3] 09:58:43 [airdcpp-release-validator] Following problems were found while scanning the share directory /Share/***/: multiple NFO files (count: 1), extra files in release directory (count: 1)

[1] Warning caused by an NFO being present, but it matches a regex in the share skiplist. [2] Warning caused by a hidden file, settings has Share hidden files unchecked. [3] is an error indicating that the bundle will not be added to the share.

maksis commented 4 years ago

This was initially considered and I even added an API method for this as there are also various other conditions that would cause items to be filtered from share. I don't remember whether there was a specific reason for not adding the check in this extension after all but I can take a look at it (unless someone else wants to play with the code and try it out).

maksis commented 3 years ago

Please try this with version 1.1.0-beta. The option to ignore excluded files is disabled by default as checking each shared file/directory via the API increases the scan duration quite a lot (for me it went from 5 seconds to almost 30 seconds when scanning the whole share).

gangefors commented 3 years ago

Verified +1

Where can I look at the commit(s) adding this feature? I don't see any commits made to master and no other branch is available in the github repo.

Here are some added details regarding the testing I did.

Adding a new unscanned bundle to the share:

# Disabled
21:53:44
File list refresh initiated for the path /Share/
21:53:44
[airdcpp-release-validator] /Share/: Multiple NFO files
21:53:44
[airdcpp-release-validator] /Share/: Extra files in release directory (1 file(s): .date)
21:53:44
[airdcpp-release-validator] Following problems were found while scanning the share directory /Share/: multiple NFO files (count: 1), extra files in release directory (count: 1)
21:53:44
The directory /Share/ has been refreshed

# Enabled
21:54:34
The directory /Share/ has been queued for refreshing
21:54:34
File list refresh initiated for the path /Share/
21:54:34
Some of the files from directory /Share/ won't be shared: File matches the share skiplist (affected file(s): imdb.nfo)
21:54:34
The directory /Share/ has been refreshed (824.85 MiB of files have been added for hashing; they won't be visible in the file list until they have finished hashing)
21:54:41
The directory /Share/ has finished hashing: 20 files (824.85 MiB) have been hashed in 5 seconds (157.71 MiB/s)

Scanning an existing share with no issues: about 60 ms additional time per bundle Scanning an existing share with a lot of issues: about 115 ms added per bundle

I think the added time is acceptable considering that I can get bundles added to the share that wouldn't otherwise be added using this extension. Usually the time added is only when first adding the bundle and that will be mostly unnoticable, it's only when doing a full share scan the added time really shows.

gangefors commented 3 years ago

And thanks of fixing the issue. 👍

maksis commented 3 years ago

Where can I look at the commit(s) adding this feature?

Ah, I forgot to push the changes.

You will also find scan durations and the number of ignored files/directories from extension's log file (airdcpp-release-validator/logs/output.log).

gangefors commented 3 years ago

Didn't know that there existed such logs, but after reading the source I now see that there are quite detailed figures in there. Thanks.

doobnet commented 3 years ago

checking each shared file/directory via the API

So for each file it's making an API request to check if excluded or not? Is there a reason to not make just one API request to pull down the skiplist and have the extension perform the check?

Alternatively have an API endpoint that accepts a batch of files to check.

gangefors commented 3 years ago

@doobnet I looked at the source for both the plugin and the check done in airdcpp-webclient when calling the API endpoint. There are more to it than just the skip-list. There are other exclude options like don't share hidden files, 0-byte files, excluded directories etc. I personally don't see any reason to duplicate that complexity in this extension. Code duplication always create issues.

I don't know if there is a way to do this scan with multiple threads running in parallel to see if there is some way of making it take less time and minimize TCP request wait times. It might be running multiple threads already, but I'm not that familiar with node.js to understand the extension in detail.

doobnet commented 3 years ago

I looked at the source for both the plugin and the check done in airdcpp-webclient when calling the API endpoint. There are more to it than just the skip-list. There are other exclude options like don't share hidden files, 0-byte files, excluded directories etc. I personally don't see any reason to duplicate that complexity in this extension. Code duplication always create issues.

Fair enough. I missed that it was more things than just the skiplist.

I don't know if there is a way to do this scan with multiple threads running in parallel to see if there is some way of making it take less time and minimize TCP request wait times. It might be running multiple threads already, but I'm not that familiar with node.js to understand the extension in detail.

I guess the first thing would be to figure out were the extra time is spent. Node.js uses non-blocking IO and an event loop. It should be able to issue multiple requests without having to wait for the first one to complete, without using threads. Alternatively support passing a batch of paths in the request to reduce the number of requests that need to be made.

If the extra time is spent on the server side, it's a little more complicated. Is the time spent on receiving the requests, is the time spent on looking up files on disk or on processing everything. If the time is spent on waiting for IO, i.e. receiving the requests or looking up the files, then non-blocking IO and an event loop would help (multiple threads can also be used for this). If the time is spent on CPU cycles, then the only thing would help is to use multiple threads, up to the number of logical cores.

maksis commented 3 years ago

I've made some profiling with the application and the latest beta dropped the scan time from 30 seconds to 20 seconds on my slow NAS server (results with and without ignore checks):

scanned 1750 directories and 51850 files, took 19624 ms (11.21 ms per directory, 0.38 ms per file), ignored 298 directories and 15 files

scanned 2048 directories and 53345 files, took 5421 ms (2.65 ms per directory, 0.10 ms per file)

Results on a faster system with the Windows app (results with and without ignore checks):

scanned 7273 directories and 55551 files, took 3572 ms (0.49 ms per directory, 0.06 ms per file), ignored 82 directories and 955 files

scanned 8116 directories and 59313 files, took 1516 ms (0.19 ms per directory, 0.03 ms per file)

I'm not sure if the results are bad given the amount of request it makes and all the communication overhead.

In the API, only about 25% of the total CPU is being spent in the application code while the rest is spent on websockets /networking/JSON parsing. A new validation endpoint accepting multiple files in a single request would be the next thing to try to make it faster.

gangefors commented 3 years ago

Thanks for the profiling. I find those numbers totally acceptable for my use-cases. Scanning a single bundle is not noticeably slower, and scanning the whole share is something usually done in the background where I don't really care if it takes 5 minutes or 15, or even more.

And again, thanks for fixing the issue. Much appreciated.

doobnet commented 3 years ago

I agree with gangefors.