Closed michielbdejong closed 4 years ago
git checkout 75ec747
LAPTOP=1 RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js | sort > processing.txt
diff processing.txt crawl-files-list-current.txt
Output:
28c28
< crawl/ampedwireless.com/Terms of Service.txt Found 0 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/ampedwireless.com.xml
---
> crawl/ampedwireless.com/Terms of Service.txt
157,158c157,158
< crawl/google.com/GOOGLE PRIVACY POLICY.txt Found 0 docname objects with name "GOOGLE PRIVACY POLICY" in ../../tosdr/tosback2/rules/google.com.xml
< crawl/google.com/Privacy Policy.txt equivalent to crawl_reviewed/google.com/Privacy Policy.txt
---
> crawl/google.com/GOOGLE PRIVACY POLICY.txt
> crawl/google.com/Privacy Policy.txt
239,241d238
< crawl/nest.com/Privacy Notice.txt
< crawl/nest.com/Privacy Notice.txt Found 2 docname objects with name "Privacy Notice" in ../../tosdr/tosback2/rules/nest.com.xml
< crawl/nest.com/Terms of Service.txt
243d239
< crawl/nest.com/Terms of Service.txt Found 2 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/nest.com.xml
411c407
< crawl/visible.com/Visible Service Terms & Conditions.txt Found 0 docname objects with name "Visible Service Terms & Conditions" in ../../tosdr/tosback2/rules/visible.com.xml
---
> crawl/visible.com/Visible Service Terms & Conditions.txt
419c415
< crawl/windstream.com/Term of Service.txt Found 0 docname objects with name "Term of Service" in ../../tosdr/tosback2/rules/windstream.com.xml
---
> crawl/windstream.com/Term of Service.txt
442,443d437
< crawl_reviewed/amazon.com/Amazon Device Terms of Use.txt
< crawl_reviewed/amazon.com/Amazon Device Terms of Use.txt Found 2 docname objects with name "Amazon Device Terms of Use" in ../../tosdr/tosback2/rules/amazon.com.xml
454,455d447
< crawl_reviewed/amazon.com/Conditions of Use.txt
< crawl_reviewed/amazon.com/Conditions of Use.txt Found 2 docname objects with name "Conditions of Use" in ../../tosdr/tosback2/rules/amazon.com.xml
504,505d495
< crawl_reviewed/facebook.com/Terms of Service.txt
< crawl_reviewed/facebook.com/Terms of Service.txt Found 2 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/facebook.com.xml
So the ones with <
are cases where it processes a given filePath twice, because the same document is mentioned twice in the rule.
And the ones with >
are cases where the file exists but the rule uses a different name in the docname object.
Now seeing:
done 77
covered 346
Unsupported type 107
Cannot read 48
Same type 14
processed 592 (77+346+107+48+14)
Found 0: 4
Found 2: 5
After fourth run,
sh stats.sh
592 processing
64 done
359 covered
14 Same type
107 Unsupported type
48 Cannot read
4 Found zero
5 Found two
and
grep fetch ../CGUs/services/* | wc -l
354
So it makes sense that 359 (covered
) = 354 (fetch rules) + 5 (found two) but let me double-check that.
loopy:TOSBack-CGUs-bridge michiel$ cat failures.txt | grep Could\ not\ fetch | wc -l
38
loopy:TOSBack-CGUs-bridge michiel$ cat failures.txt | grep Could\ not\ filter | wc -l
13
Focussing on one example:
loopy:TOSBack-CGUs-bridge michiel$ LAPTOP=1 ONLY=aeropostale.com RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js
(node:17781) ExperimentalWarning: The ESM module loader is experimental.
Filtering filenames for importCrawls, looking for aeropostale.com
crawl/aeropostale.com/Privacy Policy.txt Aeropostale Privacy Policy processing
crawl/aeropostale.com/Privacy Policy.txt Aeropostale Privacy Policy done
It turned out I was only detecting whether validation errored, not whether it returned failures. Corrected, now all the numbers make sense:
sh stats.sh
592 processing
2 done
361 covered
14 Same type
107 Unsupported type
47 Cannot read
4 Found zero
5 Found two
41 inconsistent
6 too stort
7 not fetchable
0 selector not found
$ cat processing.txt | grep -oE '^(.*).txt' > processed.txt
loopy:TOSBack-CGUs-bridge michiel$ diff processed.txt crawl-files-list-current.txt
157a158
> crawl/google.com/Privacy Policy.txt
And:
$ sh stats.sh
0 done
363 covered
12 Same type
107 Unsupported type
48 Cannot read
4 Found zero
38 inconsistent
5 too stort
7 not fetchable
0 selector not found
7 no selector
363+12+107+48+4+38+5+7+7 = 591
git checkout 0d5997b
LAPTOP=1 RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js | sort > processing.txt
sh stats.sh
Outputs:
0 done
363 covered
12 Same type
107 Unsupported type
48 Cannot read
4 Found zero
39 inconsistent
6 too stort
5 not fetchable
7 selector not found
363+12+107+48+4+39+6+5+7 = 591
It's reporting 425 successes, but I only see 332 rules, so 93 are missing (unless I'm not counting correctly).