Closed geoffmcl closed 6 years ago
Test t/ignoretext.t
presents a difficult test case to fix! Maybe because I do not understand the purpose of the test...
First HTML::Tidy
has its own API for ignoring
certain warning messages... no particular problem with that... like it uses -
$tidy->ignore( text => qr/bogotag/ );
$tidy->ignore( text => [ qr/UNESCAPED/, qr/doctype/i ] );
If we examine part of the html string presented to libtidy...
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=iso-8859-1">
<META NAME="Author" Content="Andy Lester">
<TITLE>petdance.com: Andy Lester's Programming & Writing</TITLE>
...
</HEAD>
<BODY BGCOLOR="white">
<BOGOTAG>
<IMG SRC="/pix/petdance-logo-400x312.gif" HEIGHT=312 WIDTH=400 ALT="Andy & Amy's Pet Supplies & Dance Instruction" ALIGN=RIGHT>
...
Obviously Tidy 5.5.27 will really bark when it finds <BOGOTAG>
, and outputs -
line 23 column 1 - Error: <bogotag> is not recognized!
Since this is an error, then tidy will not generate output, without --force-output yes
... but as previously touched, this option can be added...
So, the problem becomes, what is being tested here?
As mentioned in commit c446e618, we can remove this force-output
as unsupported, and add this option to the config... like my $tidy = HTML::Tidy->new( { force_output => 1 } );
, like in another test case...
But still the question - "What is being tested here?"
Now current tidy has new options like new-blocklevel-tags
, new-empty-tags
, new-inline-tags
, new-pre-tags
, combined with or without the new custom-tags
option, so such a <bogotag>
can be added - Tidy will still warn it is not W3C approved, but would keep it...
So again, I just do not understand - "What is being tested here?" - so have no way to fix this test...
Feedback welcome... thanks...
And this problem is repeated in the next test t/ignore.t
... so will probably have the same or similar solution...
And maybe t/levels.t
... except now Tidy 5.5.27 does not warn about unambiguous &
chars... so like <h2>About Andy & Amy</h2>
will be accepted, as in the html5 spec...
except now Tidy 5.5.27 does not warn about unambiguous & chars.
Is that configurable? I'd hate to lose that as a check.
@petdance well that is true, as part of the HTML5 specs...
But if you want to still see that particular warning, then you need to add a DOCTYPE
to indicate that it is a legacy document, like -
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A & B</h1>
</body>
</html>
Now Tidy will output -
line 9 column 7 - Warning: unescaped & which should be written as &
Tidy found 1 warning and 0 errors!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.27">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A & B</h1>
</body>
</html>
The point is, tidy starts in HTML5 mode, and will treat the html accordingly... So remove the DOCTYPE
from that sample, and you will get -
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
Tidy found 1 warning and 0 errors!
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.27">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A & B</h1>
</body>
</html>
So it seems just a case of adjusting the sample to what you want to test...
And that's just what I did to fix roundtrip.t
, which is a very good test... It tests that Tidy's output should be free of warnings... clean...
Since it is only a fragment html, then it will be treated as HTML5, so adjusted the expected to html5 output...
And used the option --tidy-mark no
to remove the addition of that meta, and this saves trying to manipulate the cleaned
text for the compare...
Seems it now passes... see the above commit aec1d97...
@petdance and the addition of tidy_mark => 0
simplified t/unicode.t
, another very good test...
Seems the above commit fixes it...
@petdance just adjusted config a little, and the expects, and t/venus.t
passes...
@petdance had a look at t/version.t
but will leave its correction to you, since maybe you could change say HTML::Tidy->tidyp_version
to just HTML::Tidy->tidy_version
or something...
And is it necessary to have two services to do the same thing? We do also have a tidyReleaseDate()
, but it is marked as Deprecated:, although I do not think we are going to remove it any time soon...
Of course at present tidyLibraryVersion()
will return /^\d+\.\d+\.\d+/
, something like semantic version... major.minor.patch
... these values come from the version.txt file in the source...
But also if the library is built with a RC version. like -DTIDY_RC_NUMBER=I234
, then it can become say 5.5.27.I234
... ie the RC string is added, what ever it is defined as, and does not have to actually be a number
... just that any test
should not automatically assume the string ends after the 3rd patch
number...
I can fix t/version.t
if you like, but would appreciate some feedback on the above points... thanks...
simple.t
seems fixed by just changing warning count from 5 to 6...
Test t/too-many-title.t
, removed the non-html5 bidy attribute, BGCOLOR, not really part of the test, and adjust the warning and it passes...
This maybe exposes a bug in tidy 5.5.27, in that the warning message now looks bad -
Was, which looks good -
- (4:9) Warning: too many title elements in <head>
Changed to something that does not look right -
- (4:9) Warning: too many title elements in <title>
Must remember to file a HTML Tidy issue for this...
Also tidy leaves the two titles there... is this valid html? Need to check this with validator, and maybe raise another tidy issue...
But for now this test passes... but may need to be changed when tidy is fixed...
@petdance seems test t/unicode-nbsp.t
can be fixed by adding the option newline => 'Lf'
...
This is testing in Windows, so assume unix will have no problem with this...
See above commit 15469d4
This is the end of my fun day... be back soonest...
This commit 9095696 is hopefully just a temporary fix of my convenient build-me.bat
file!
Just until such time ExtUtils::MakeMaker
can be convinced to not add -nodefaultlib
to the linker options in Windows. This option causes the link of Tidy.dll
to fail...
Have written a small script modmakefile.pl
to manually delete this option... and write an ammended Makefile
... this is run through a batch file modmakefile.bat
, which is in my PATH
, and runs the script...
As indicated, I hope this is temporary... there must be a better way to modify the MSVC Windows linker
options added to the Makefile
generated...
Current Tidy has changed the summary
message... Presently this is ignored in Tidy.pm
...
The commit 658eaba seem to remove the Carp
message...
Maybe there should be a way to externalized
this message matching, since it can change with tidy versions...
Test wordwrap.t
has no doctype
, so adjust the expected to html5 output...
Commit 74af83d show this is really only adjusting the now default
html5 doctype emitted...
@petdance now most tests pass, in Windows, but the Summery
shows 4, maybe 5?, still fail!
I have already spoken about 4., t/version.txt
here... I can fix it, but seek some feedback...
The other three, 1..3
, seem to use the same input html, which contains an unknown tag, <BOGOTAG>
, which will be flagged as an error, and there will be no output, unless force_out => 1
is added to the config...
But as indicated, I am unsure of the purpose of these tests, at least the first 2, so hesitate to suggest a fix...
As discussed here 3. could test the conversion of an unambiguous &
to &
by adding a legacy doctype
using the small sample html given...
Further, although 5., test t/segfault-form.t
seems to pass, shows ok
, but I do not think it should! It contains html that tidy marks as an error, so there will be no cleaned
output, which leads to a carp -
Use of uninitialized value $newline in regexp compilation at F:\Projects\html-tidy-pet-fork\blib\lib/HTML/Tidy.pm line 242, <DATA> line 1.
Use of uninitialized value $errs in split at F:\Projects\html-tidy-pet-fork\blib\lib/HTML/Tidy.pm line 242, <DATA> line 1.
As discussed in a comment a few days ago, where I suggest a small problem in Tidy.xs, around line 163... effectively, the _tidy_clean
service must always return three values, as shown in Tidy,pm -
my ($cleaned, $errbuf, $newline) = _tidy_clean( $text, ... );
But there are situations, especially when there is a TIDY_ERROR, where $errbuf
and $newline
can be left uninitialized, and maybe $cleaned
would actually contain the $newline
value... not sure...
Adding the option force_output => 1
can hide the problem, but these variables should have values in all cases... but I am not too familiar with coding like XPUSHs( sv_2mortal(newSVpv(char *, len)) );
... so need help...
I get especially confused when I read the code XPUSHs( sv_2mortal(newSVpv(newline, 0)) );
, in that at that moment newline
will contain 1 or 2 characters... How can the length be ZERO? But maybe I do not understand something here...
Anyway, seek help, feedback to close these 5 cases... thanks...
Anyway, seek help, feedback to close these 5 cases... thanks...
I'm not at all clear on what it is you're trying to do. Are you trying to make html-tidy use the new library? What are you asking for help on?
@petdance sorry if something is not clear. Specifically -
I'm not at all clear on what it is you're trying to do.
As the title of this issue states Release an updated HTML::Tidy perl library.
What is unclear about that? This would be effectively closing your Issue 25
Are you trying to make html-tidy use the new library?
Exactly... it is using HTML Tidy next, currently at 5.5.27... as shown in all outputs...
We may want to back up to the last release 5.4, and have a master
branch html-tidy
matching that, and then say the next
branch matching 5.5.27-plus... or something...
How do you handle versioning of html-tidy
?
What are you asking for help on?
ignore-text.t
and ignore.t
?levels.t
is to test &
conversion, can we change it to the simple sample given, with a doctype?get version
functions, and should we remove the p
?_tidy_clean
to always return 3, or XSRETURN_UNDEF in extreem cases?That's it folks! ;=)) Thanks...
It sounds like your plan is to hand all this work off to me as an update to the Perl HTML::Tidy module. If that's the case, I appreciate the help and enthusiasm, and you may want to hold off.
I haven't looked at the new tidy at all. Zero. I don't know that I want/can/should simply switch to a new library. That is one potential solution, but I don't know that it's what I'm going to be doing in the end.
For starters, I need to know what the differences are between the two libraries, and what compatibility between them is going to be like and so on. I may look at it this weekend, but I may not.
I just don't want you to spend time on something that I might wind up not using.
@petdance zute, it sounds like my help was not desirable... no problem... in FOSS we each do what we want, when we want...
...hand all this work off to me...
Never! Absolute Rubbish! What work?
I was seeking your opinion as to how to deal with certain very specific problems/issues, if the perl HTML::Tidy
was updated to the current HTML Tidy library... but if you have no interest in that, then no problem...
Wow, whether you use my upgrade or not, is no problem... that is entirely your choice...
So now I am getting the idea that I should publish this as HTML/Tidy5
... obviously the 5
stands for support HTML5..
All I want if a CPAN Module that supports HTML5... use HTML/Tidy5
is all you need...
You can look at what you want
, over many weekends... no pressure... no problem...
Me, I will now proceed to solve the last 5 issues, without your feedback... again no problem...
Excuse me, I will use my time, as I see fit... thankful that the fork free software license allows this...
You might not understand the differences are between the two libraries, and not want to, but some of us do... sorry for your loss... I really tried to help...
Tomorrow. or soonest, with or without your help, will solve the last 5 Windows test problems...
And with those solved will move towards a similar unix solution...
Really sorry we could not work together on this... but no problem... my fork exists only for people who want to use it... just for fun... FOSS...
it sounds like my help was not desirable...
I'm sorry if it sounded like that. That's not the case at all.
I'm just saying that I haven't had time to look at anything you've done over the past two days. I was just trying to set expectations.
I very much want to get the new Tidy in Perl bindings, too. I just don't know how yet.
@petdance right now all tests PASS, in Windows...
Microsoft (R) Program Maintenance Utility Version 14.00.24210.0
Copyright (C) Microsoft Corporation. All rights reserved.
C:\Perl64\bin\perl.exe "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib\lib', 'blib\arch')" t/*.t
# Testing HTML::Tidy 1.58, tidyp 5.5.27, Perl 5.014002, C:\Perl64\bin\perl.exe
t/00-load.t .......... ok
t/cfg-for-parse.t .... ok
t/clean-crash.t ...... ok
t/clean.t ............ ok
t/extra-quote.t ...... ok
t/ignore-text.t ...... ok
t/ignore.t ........... ok
t/illegal-options.t .. ok
t/levels.t ........... ok
t/message.t .......... ok
t/opt-00.t ........... ok
t/parse-crash.t ...... ok
HTML::Tidy: Unknown error type: FAKE_ERROR_TYPE at t/parse-errors.t line 18
t/parse-errors.t ..... ok
t/parse.t ............ ok
t/perfect.t .......... ok
t/pod-coverage.t ..... skipped: Test::Pod::Coverage 1.04 required for testing POD coverage
t/pod.t .............. skipped: Test::Pod 1.14 required for testing POD
t/roundtrip.t ........ ok
t/segfault-form.t .... ok
t/simple.t ........... ok
t/too-many-titles.t .. ok
t/unicode-nbsp.t ..... ok
t/unicode.t .......... ok
t/venus.t ............ ok
t/version.t .......... ok
t/wordwrap.t ......... ok
All tests successful.
Files=26, Tests=78, 2 wallclock secs ( 0.03 usr + 0.11 sys = 0.14 CPU)
Result: PASS
But, as mentioned above, did not fully understand what some of the test were testing, so some fixes could be suspect...
Like adding the option force_output => 1
to remove the carp from test t/segfault-form.t. As discussed above, I feel there should be a deeper fix done in the code, but I do not presently have the skill to do that... And adding a DOCTYPE
to at least 3 test to force legacy document parsing, to get the expected warnings... etc...
As the test1
branch here presently stands, it is 45 commits ahead of the upstream petdance:dev... and thus could be easily merged... maybe into a new-html5
branch, as there are already several branches with this like name, but I have no idea of their state...
Anyway, I am very happy that I have an installed and working HTML::Tidy module, in Windows, using libTidy
5.5.27 ;=))
I guess the next steps, if there are any - no
expectations, rests with what you, or others want to do... thanks...
As stated before this is a great perl wrapper around current libTidy
!
After a lot of work, way back then, did get a Tidy.dll
built and installed in my perl
installation, and got a working HTML::Tidy
, using library tidy version 5.5.27
Directory of C:\Perl64\site\lib\auto\HTML\Tidy
2017-05-27 14:26 599,552 Tidy.dll
The following was my 'test', that worked with or without using the config file -
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::Tidy;
my $use_conf_file = 1;
my ($tidy);
if ($use_conf_file) {
#my $txt = "show-body-only: 1\ntidy-mark: 0\n";
my $txt = "indent: 1\nwrap: 0\nshow-info: 0\n";
my $file = 'temptidy.cfg';
open WOF, ">$file" or die "ERROR: Unable to open $file! $!\n";
print WOF $txt;
close WOF;
$tidy = HTML::Tidy->new( {'config_file' => $file} );
} else {
#my $args = { show_body_only => 1,
# show_info => 0 };
my $args = { indent => 1,
wrap => 0,
tidy_mark => 0,
show_info => 0 };
$tidy = HTML::Tidy->new( $args );
}
if (! $tidy) {
die "ERROR: Failed to load HTML::Tidy ... $! ...\n";
}
my $vers = $tidy->tidyp_version();
if ($vers) {
print "Using version $vers\n";
}
my $doc = <<"EOF;";
<p>Hello Tidy!</p>
EOF;
#my $rc = $tidy->parse( '-', $doc );
#print "$clean\n";
my $clean = $tidy->clean( $doc );
print "$clean\n";
my @msg = $tidy->messages();
if (@msg) {
my $cnt = scalar @msg;
print "Have $cnt messages...\n";
print join("\n",@msg)."\n";
}
# tidy04.pl - eof
So belatedly closing this...
As repeatedly stated thanks for this great perl wrapper around the current libtidy...
This mirrors Issue 562 in the @HTACG tidy-html5 repo... and Issue 25 in the @petdance html-tidy repo...
As present working on the Windows x64 build, in the
test1
branch here...