ambanum / TOSBack-CGUs-bridge

0 stars 1 forks source link

Number of imported rules doesn't match number of sucesses reported #8

Closed michielbdejong closed 4 years ago

michielbdejong commented 4 years ago

It's reporting 425 successes, but I only see 332 rules, so 93 are missing (unless I'm not counting correctly).

michielbdejong commented 4 years ago
git checkout 75ec747
LAPTOP=1 RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js  | sort > processing.txt
diff processing.txt crawl-files-list-current.txt

Output:

28c28
< crawl/ampedwireless.com/Terms of Service.txt Found 0 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/ampedwireless.com.xml
---
> crawl/ampedwireless.com/Terms of Service.txt
157,158c157,158
< crawl/google.com/GOOGLE PRIVACY POLICY.txt Found 0 docname objects with name "GOOGLE PRIVACY POLICY" in ../../tosdr/tosback2/rules/google.com.xml
< crawl/google.com/Privacy Policy.txt equivalent to crawl_reviewed/google.com/Privacy Policy.txt
---
> crawl/google.com/GOOGLE PRIVACY POLICY.txt
> crawl/google.com/Privacy Policy.txt
239,241d238
< crawl/nest.com/Privacy Notice.txt
< crawl/nest.com/Privacy Notice.txt Found 2 docname objects with name "Privacy Notice" in ../../tosdr/tosback2/rules/nest.com.xml
< crawl/nest.com/Terms of Service.txt
243d239
< crawl/nest.com/Terms of Service.txt Found 2 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/nest.com.xml
411c407
< crawl/visible.com/Visible Service Terms & Conditions.txt Found 0 docname objects with name "Visible Service Terms & Conditions" in ../../tosdr/tosback2/rules/visible.com.xml
---
> crawl/visible.com/Visible Service Terms & Conditions.txt
419c415
< crawl/windstream.com/Term of Service.txt Found 0 docname objects with name "Term of Service" in ../../tosdr/tosback2/rules/windstream.com.xml
---
> crawl/windstream.com/Term of Service.txt
442,443d437
< crawl_reviewed/amazon.com/Amazon Device Terms of Use.txt
< crawl_reviewed/amazon.com/Amazon Device Terms of Use.txt Found 2 docname objects with name "Amazon Device Terms of Use" in ../../tosdr/tosback2/rules/amazon.com.xml
454,455d447
< crawl_reviewed/amazon.com/Conditions of Use.txt
< crawl_reviewed/amazon.com/Conditions of Use.txt Found 2 docname objects with name "Conditions of Use" in ../../tosdr/tosback2/rules/amazon.com.xml
504,505d495
< crawl_reviewed/facebook.com/Terms of Service.txt
< crawl_reviewed/facebook.com/Terms of Service.txt Found 2 docname objects with name "Terms of Service" in ../../tosdr/tosback2/rules/facebook.com.xml

So the ones with < are cases where it processes a given filePath twice, because the same document is mentioned twice in the rule. And the ones with > are cases where the file exists but the rule uses a different name in the docname object.

michielbdejong commented 4 years ago

Now seeing:

michielbdejong commented 4 years ago

After fourth run,

sh stats.sh 
     592    processing
      64    done
     359    covered
      14    Same type
     107    Unsupported type
      48    Cannot read
       4    Found zero
       5    Found two

and

grep fetch ../CGUs/services/* | wc -l
     354

So it makes sense that 359 (covered) = 354 (fetch rules) + 5 (found two) but let me double-check that.

michielbdejong commented 4 years ago
loopy:TOSBack-CGUs-bridge michiel$ cat failures.txt | grep Could\ not\ fetch | wc -l
      38
loopy:TOSBack-CGUs-bridge michiel$ cat failures.txt | grep Could\ not\ filter | wc -l
      13
michielbdejong commented 4 years ago

Focussing on one example:

loopy:TOSBack-CGUs-bridge michiel$ LAPTOP=1 ONLY=aeropostale.com RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js 
(node:17781) ExperimentalWarning: The ESM module loader is experimental.
Filtering filenames for importCrawls, looking for aeropostale.com
crawl/aeropostale.com/Privacy Policy.txt Aeropostale Privacy Policy processing
crawl/aeropostale.com/Privacy Policy.txt Aeropostale Privacy Policy done
michielbdejong commented 4 years ago

It turned out I was only detecting whether validation errored, not whether it returned failures. Corrected, now all the numbers make sense:


sh stats.sh 
     592    processing
       2    done
     361    covered
      14    Same type
     107    Unsupported type
      47    Cannot read
       4    Found zero
       5    Found two
      41    inconsistent
       6    too stort
       7    not fetchable
       0    selector not found
michielbdejong commented 4 years ago
$ cat processing.txt | grep -oE '^(.*).txt' > processed.txt
loopy:TOSBack-CGUs-bridge michiel$ diff processed.txt crawl-files-list-current.txt 
157a158
> crawl/google.com/Privacy Policy.txt

And:

$ sh stats.sh 
       0    done
     363    covered
      12    Same type
     107    Unsupported type
      48    Cannot read
       4    Found zero
      38    inconsistent
       5    too stort
       7    not fetchable
       0    selector not found
       7    no selector

363+12+107+48+4+38+5+7+7 = 591

michielbdejong commented 4 years ago
git checkout 0d5997b
LAPTOP=1 RULES_ONLY=1 node --unhandled-rejections=strict tosback-import.js  | sort > processing.txt
sh stats.sh

Outputs:

       0    done
     363    covered
      12    Same type
     107    Unsupported type
      48    Cannot read
       4    Found zero
      39    inconsistent
       6    too stort
       5    not fetchable
       7    selector not found

363+12+107+48+4+39+6+5+7 = 591