WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

NetworkMap Tree View: it would be useful to prune or recrawl the selected URLs and their children #76

Closed leefrank9527 closed 1 year ago

leefrank9527 commented 1 year ago

Currently only the selected URLs can be pruned or be recrawled. If you would like to prune or recrawl the outlink URLs, you have to expand the tree nodes and select all of them. A more complex scenario is you would like to prune or recrawl an URL and it's outlink URLs, but would like to exclude part of it's outlinks URLs. It's would be useful if the users could prune or recrawl the URLs and it's outlink URLs in one session.

leefrank9527 commented 1 year ago

A solution is to extand the context menu on the tree view to allow the users to apply the "prune" or "recrawl" actions to the selected URLs or the URLs and it's outlink URLs. The details:

  1. Extend the context menu to allow more actions. Such as the prune sub menu: Prune----Prune the current URL |-Prune the current URL and the outlinks. |-Prune the selected URLs |-Prune the selected URL and the outlinks. --------------Separator--------------------------------- |-Exclude the current URL |-Exclude the current URL and the outlinks. |-Exclude the selected URLs |-Exclude the selected URL and the outlinks.
  2. Each action is only applied to the selected (or the current) rows and ignore the unselected rows. Otherwise it will lead to ambiguity. To exclude part of the outlinks, you can select the rows you would like to exclude and click the exclude relevant mune items. Currently the URLs to be pruned or recrawled are cached inside "Patch" list, it's available for users to operate multiple times to get the proper URLs to be pruned or recrawled.
  3. Will change the "Crawler path view", "Folders View" and "Inspect View" based on the same solution. Please note that for the "Folders View", it's orgnised with the names of the URLs instead of the crawler path of the URLs. The children nodes on the folder tree view is the subfolders or the URLs insides the node or subfolders. The outlinks applied to the prune and recrawl actions always means the crawler path of the URLs.
  4. If the outlinks are included in the prune or recrawl actions, the outlink will be retrieved from the backend server and cached in the "Patch Harvest" list, and will be highlighted on "Crawler path view", "Folders View" and "Inspect View". So the users can know which URL will be pached explicitly. But for a huge set of outlinks it will lead to a low performance and high memory usage.
obrienben commented 1 year ago

fixed in #75