RequestPolicyContinued / subscriptions

subscriptions for RequestPolicyContinued
http://requestpolicycontinued.github.io/
Other
19 stars 8 forks source link

review official-allow_sameorg.json #11

Closed nodiscc closed 9 years ago

nodiscc commented 9 years ago

Fixes https://github.com/RequestPolicyContinued/subscriptions/issues/10

Review process:

Differences with the previous subscription:

All origin domains from the previous list work as expected.

nodiscc commented 9 years ago

I think this is ready to merge. See commit details and feel free to ask if you need any precision. This comment https://github.com/RequestPolicyContinued/subscriptions/issues/9#issuecomment-65481767 is also related to future updates of this subscription.

myrdd commented 9 years ago

Great work @nodiscc! I'd suggest that you may work directly on the master branch – just be careful about removing subscriptions as the code depends on them (currently). In other words, keep on managing subscriptions; you're doing a good job.

nodiscc commented 9 years ago

Thanks for pointing this out @myrdd I've not searched for duplicates yet. This is an error during moving _itemsByRegion manually. I still have to find a tool to deduplicate lines.

I added the world's top25 sites in cc0e24d.

myrdd commented 9 years ago

take a look at sort and uniq, optionally paired with grep.

nodiscc commented 9 years ago

I've removed all duplicate rules. The file passes the JSON validator at http://jsonlint.com/ I'll merge this after a coffee break, and keep working on top26-100 websites in another Pull Request.

myrdd commented 9 years ago

I've removed all duplicate rules.

Actually I could find only one duplicate line. (see below) ; I was wrong, see my next comment.

I still have to find a tool to deduplicate lines.

I successfully tested the following:

$ egrep -o "\{.*\}" official-allow_sameorg.json | uniq -d
{"o":{"h":"*.uol.com.br"},"d":{"h":"*.imguol.com"}}
nodiscc commented 9 years ago

This is what I did:

 ↳  sort official-allow_sameorg.json |uniq -c |sort
[...]
      2       {"o":{"h":"*.baidu.com"},"d":{"h":"*.bdstatic.com"}},
      2       {"o":{"h":"*.google.com"},"d":{"h":"*.gstatic.com"}},
      2       {"o":{"h":"*.tagged.com"},"d":{"h":"*.tagstat.com"}},
      2       {"o":{"h":"*.taobao.com"},"d":{"h":"*.taobaocdn.com"}},
      2       {"o":{"h":"*.uol.com.br"},"d":{"h":"*.imguol.com"}},
myrdd commented 9 years ago

Oh you're right – I forgot to sort beforehand.

FYI, this is what can be used alternatively:

$ egrep -o "\{.*\}" official-allow_sameorg.json | sort | uniq -d 
{"o":{"h":"*.baidu.com"},"d":{"h":"*.bdstatic.com"}}
{"o":{"h":"*.google.com"},"d":{"h":"*.gstatic.com"}}
{"o":{"h":"*.tagged.com"},"d":{"h":"*.tagstat.com"}}
{"o":{"h":"*.taobao.com"},"d":{"h":"*.taobaocdn.com"}}
{"o":{"h":"*.uol.com.br"},"d":{"h":"*.imguol.com"}}

The difference to @nodiscc's version: Besides the -d parameter (it means that only duplicates are displayed), grep removes non-rule lines and ensures that this special case is also catched:

[
  {"d":{"h":"*.A.com"}},
  {"d":{"h":"*.A.com"}}
]

You can test the difference easily:

#!/bin/bash

json='[\n  {"d":{"h":"*.A.com"}},\n  {"d":{"h":"*.A.com"}}\n]'

echo "# with grep:"
echo -e $json | egrep -o "\{.*\}" | sort | uniq -c

echo
echo "# without grep:"
echo -e $json | sort | uniq -c

output:

# with grep:
      2 {"d":{"h":"*.A.com"}}

# without grep:
      1 [
      1 ]
      1  {"d":{"h":"*.A.com"}}
      1  {"d":{"h":"*.A.com"}},

@nodiscc what about a simple makefile which collects some simple scripts?