TeamHG-Memex / Formasaurus

Formasaurus tells you the type of an HTML form and its fields using machine learning
116 stars 47 forks source link

formasaurus check-data fails locally #17

Open kmike opened 7 years ago

kmike commented 7 years ago

formasaurus check-data fails on my machine (but not on Travis):

Checking:  16%|####                      | 149/954 [00:00<00:05, 137.76 files/s]
Invalid form count for entry 'html/ddl-warez.in-0.html': expected 0, got 1
Invalid number of form field annotations for entry 'html/ddl-warez.in-0.html'
Checking:  59%|###############4          | 567/954 [00:03<00:01, 222.76 files/s]
Invalid form count for entry 'html/cafephim.vn-1.html': expected 0, got 2
Invalid number of form field annotations for entry 'html/cafephim.vn-1.html'
Checking:  77%|####################      | 736/954 [00:05<00:01, 209.64 files/s]
Invalid form count for entry 'html/postr.hu-2.html': expected 0, got 6
Invalid number of form field annotations for entry 'html/postr.hu-2.html'
Checking:  78%|####################1     | 740/954 [00:05<00:01, 207.79 files/s]
Invalid form count for entry 'html/www.elandroidelibre.com-0.html': expected 0, got 1
Invalid number of form field annotations for entry 'html/www.elandroidelibre.com-0.html'
Checking:  99%|#########################6| 942/954 [00:06<00:00, 252.47 files/s]
Invalid form count for entry 'html/postr.hu-1.html': expected 0, got 6
Invalid number of form field annotations for entry 'html/postr.hu-1.html'
Checking: 100%|##########################| 954/954 [00:06<00:00, 132.87 files/s]
Status: 10 error(s) found
kmike commented 7 years ago

It seems to be a problem with libxml2 version; locally I have libxml2 2.9.4, and it parses some of the files incorrectly (it leaves only contents inside NOSCRIPT tags). Ubuntu 14.04 has libxml2 2.9.1.