geoffmcl / html-tidy

Fork of HTML::Tidy, a Perl wrapper about libtidy
http://search.cpan.org/dist/HTML-Tidy
0 stars 0 forks source link

Release an updated HTML::Tidy perl library #1

Closed geoffmcl closed 6 years ago

geoffmcl commented 7 years ago

This mirrors Issue 562 in the @HTACG tidy-html5 repo... and Issue 25 in the @petdance html-tidy repo...

As present working on the Windows x64 build, in the test1 branch here...

geoffmcl commented 7 years ago

Test t/ignoretext.t presents a difficult test case to fix! Maybe because I do not understand the purpose of the test...

First HTML::Tidy has its own API for ignoring certain warning messages... no particular problem with that... like it uses -

    $tidy->ignore( text => qr/bogotag/ );
    $tidy->ignore( text => [ qr/UNESCAPED/, qr/doctype/i ] );

If we examine part of the html string presented to libtidy...

<HTML>
<HEAD>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=iso-8859-1">
    <META NAME="Author" Content="Andy Lester">
        <TITLE>petdance.com: Andy Lester's Programming &amp; Writing</TITLE>
...
</HEAD>
<BODY BGCOLOR="white">
<BOGOTAG>
    <IMG SRC="/pix/petdance-logo-400x312.gif" HEIGHT=312 WIDTH=400 ALT="Andy & Amy's Pet Supplies & Dance Instruction" ALIGN=RIGHT>
...

Obviously Tidy 5.5.27 will really bark when it finds <BOGOTAG>, and outputs -

line 23 column 1 - Error: <bogotag> is not recognized!

Since this is an error, then tidy will not generate output, without --force-output yes... but as previously touched, this option can be added...

So, the problem becomes, what is being tested here?

As mentioned in commit c446e618, we can remove this force-output as unsupported, and add this option to the config... like my $tidy = HTML::Tidy->new( { force_output => 1 } );, like in another test case...

But still the question - "What is being tested here?"

Now current tidy has new options like new-blocklevel-tags, new-empty-tags, new-inline-tags, new-pre-tags, combined with or without the new custom-tags option, so such a <bogotag> can be added - Tidy will still warn it is not W3C approved, but would keep it...

So again, I just do not understand - "What is being tested here?" - so have no way to fix this test...

Feedback welcome... thanks...

geoffmcl commented 7 years ago

And this problem is repeated in the next test t/ignore.t... so will probably have the same or similar solution...

geoffmcl commented 7 years ago

And maybe t/levels.t ... except now Tidy 5.5.27 does not warn about unambiguous & chars... so like <h2>About Andy & Amy</h2> will be accepted, as in the html5 spec...

petdance commented 7 years ago

except now Tidy 5.5.27 does not warn about unambiguous & chars.

Is that configurable? I'd hate to lose that as a check.

geoffmcl commented 7 years ago

@petdance well that is true, as part of the HTML5 specs...

But if you want to still see that particular warning, then you need to add a DOCTYPE to indicate that it is a legacy document, like -

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A & B</h1>
</body>
</html>

Now Tidy will output -

line 9 column 7 - Warning: unescaped & which should be written as &amp;
Tidy found 1 warning and 0 errors!

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.27">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A &amp; B</h1>
</body>
</html>

The point is, tidy starts in HTML5 mode, and will treat the html accordingly... So remove the DOCTYPE from that sample, and you will get -

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
Tidy found 1 warning and 0 errors!

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.27">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>562-amp</title>
</head>
<body>
<h1>A & B</h1>
</body>
</html>

So it seems just a case of adjusting the sample to what you want to test...

geoffmcl commented 7 years ago

And that's just what I did to fix roundtrip.t, which is a very good test... It tests that Tidy's output should be free of warnings... clean...

Since it is only a fragment html, then it will be treated as HTML5, so adjusted the expected to html5 output...

And used the option --tidy-mark no to remove the addition of that meta, and this saves trying to manipulate the cleaned text for the compare...

Seems it now passes... see the above commit aec1d97...

geoffmcl commented 7 years ago

@petdance and the addition of tidy_mark => 0 simplified t/unicode.t, another very good test...

Seems the above commit fixes it...

geoffmcl commented 7 years ago

@petdance just adjusted config a little, and the expects, and t/venus.t passes...

geoffmcl commented 7 years ago

@petdance had a look at t/version.t but will leave its correction to you, since maybe you could change say HTML::Tidy->tidyp_version to just HTML::Tidy->tidy_version or something...

And is it necessary to have two services to do the same thing? We do also have a tidyReleaseDate(), but it is marked as Deprecated:, although I do not think we are going to remove it any time soon...

Of course at present tidyLibraryVersion() will return /^\d+\.\d+\.\d+/, something like semantic version... major.minor.patch... these values come from the version.txt file in the source...

But also if the library is built with a RC version. like -DTIDY_RC_NUMBER=I234, then it can become say 5.5.27.I234... ie the RC string is added, what ever it is defined as, and does not have to actually be a number... just that any test should not automatically assume the string ends after the 3rd patch number...

I can fix t/version.t if you like, but would appreciate some feedback on the above points... thanks...

geoffmcl commented 7 years ago

simple.t seems fixed by just changing warning count from 5 to 6...

geoffmcl commented 7 years ago

Test t/too-many-title.t, removed the non-html5 bidy attribute, BGCOLOR, not really part of the test, and adjust the warning and it passes...

This maybe exposes a bug in tidy 5.5.27, in that the warning message now looks bad -

Was, which looks good -
- (4:9) Warning: too many title elements in <head>
Changed to something that does not look right -
- (4:9) Warning: too many title elements in <title>

Must remember to file a HTML Tidy issue for this...

Also tidy leaves the two titles there... is this valid html? Need to check this with validator, and maybe raise another tidy issue...

But for now this test passes... but may need to be changed when tidy is fixed...

geoffmcl commented 7 years ago

@petdance seems test t/unicode-nbsp.t can be fixed by adding the option newline => 'Lf'...

This is testing in Windows, so assume unix will have no problem with this...

See above commit 15469d4

This is the end of my fun day... be back soonest...

geoffmcl commented 7 years ago

This commit 9095696 is hopefully just a temporary fix of my convenient build-me.bat file!

Just until such time ExtUtils::MakeMaker can be convinced to not add -nodefaultlib to the linker options in Windows. This option causes the link of Tidy.dll to fail...

Have written a small script modmakefile.pl to manually delete this option... and write an ammended Makefile... this is run through a batch file modmakefile.bat, which is in my PATH, and runs the script...

As indicated, I hope this is temporary... there must be a better way to modify the MSVC Windows linker options added to the Makefile generated...

geoffmcl commented 7 years ago

Current Tidy has changed the summary message... Presently this is ignored in Tidy.pm...

The commit 658eaba seem to remove the Carp message...

Maybe there should be a way to externalized this message matching, since it can change with tidy versions...

geoffmcl commented 7 years ago

Test wordwrap.t has no doctype, so adjust the expected to html5 output...

Commit 74af83d show this is really only adjusting the now default html5 doctype emitted...

geoffmcl commented 7 years ago

@petdance now most tests pass, in Windows, but the Summery shows 4, maybe 5?, still fail!

  1. t/ignore-text.t
  2. t/ignore.t
  3. t/levels.t
  4. t/version.t
  5. t/segfault-form.t

I have already spoken about 4., t/version.txt here... I can fix it, but seek some feedback...

The other three, 1..3, seem to use the same input html, which contains an unknown tag, <BOGOTAG>, which will be flagged as an error, and there will be no output, unless force_out => 1 is added to the config...

But as indicated, I am unsure of the purpose of these tests, at least the first 2, so hesitate to suggest a fix...

As discussed here 3. could test the conversion of an unambiguous & to &amp; by adding a legacy doctype using the small sample html given...

Further, although 5., test t/segfault-form.t seems to pass, shows ok, but I do not think it should! It contains html that tidy marks as an error, so there will be no cleaned output, which leads to a carp -

Use of uninitialized value $newline in regexp compilation at F:\Projects\html-tidy-pet-fork\blib\lib/HTML/Tidy.pm line 242, <DATA> line 1.
Use of uninitialized value $errs in split at F:\Projects\html-tidy-pet-fork\blib\lib/HTML/Tidy.pm line 242, <DATA> line 1.

As discussed in a comment a few days ago, where I suggest a small problem in Tidy.xs, around line 163... effectively, the _tidy_clean service must always return three values, as shown in Tidy,pm -

 my ($cleaned, $errbuf, $newline) = _tidy_clean( $text, ... );

But there are situations, especially when there is a TIDY_ERROR, where $errbuf and $newline can be left uninitialized, and maybe $cleaned would actually contain the $newline value... not sure...

Adding the option force_output => 1 can hide the problem, but these variables should have values in all cases... but I am not too familiar with coding like XPUSHs( sv_2mortal(newSVpv(char *, len)) );... so need help...

I get especially confused when I read the code XPUSHs( sv_2mortal(newSVpv(newline, 0)) );, in that at that moment newline will contain 1 or 2 characters... How can the length be ZERO? But maybe I do not understand something here...

Anyway, seek help, feedback to close these 5 cases... thanks...

petdance commented 7 years ago

Anyway, seek help, feedback to close these 5 cases... thanks...

I'm not at all clear on what it is you're trying to do. Are you trying to make html-tidy use the new library? What are you asking for help on?

geoffmcl commented 7 years ago

@petdance sorry if something is not clear. Specifically -

I'm not at all clear on what it is you're trying to do.

As the title of this issue states Release an updated HTML::Tidy perl library.

What is unclear about that? This would be effectively closing your Issue 25

Are you trying to make html-tidy use the new library?

Exactly... it is using HTML Tidy next, currently at 5.5.27... as shown in all outputs...

We may want to back up to the last release 5.4, and have a master branch html-tidy matching that, and then say the next branch matching 5.5.27-plus... or something...

How do you handle versioning of html-tidy?

What are you asking for help on?

  1. What is the purpose tests ignore-text.t and ignore.t?
  2. If levels.t is to test & conversion, can we change it to the simple sample given, with a doctype?
  3. Do we need two get version functions, and should we remove the p?
  4. How to write Tidy.xs _tidy_clean to always return 3, or XSRETURN_UNDEF in extreem cases?

That's it folks! ;=)) Thanks...

petdance commented 7 years ago

It sounds like your plan is to hand all this work off to me as an update to the Perl HTML::Tidy module. If that's the case, I appreciate the help and enthusiasm, and you may want to hold off.

I haven't looked at the new tidy at all. Zero. I don't know that I want/can/should simply switch to a new library. That is one potential solution, but I don't know that it's what I'm going to be doing in the end.

For starters, I need to know what the differences are between the two libraries, and what compatibility between them is going to be like and so on. I may look at it this weekend, but I may not.

I just don't want you to spend time on something that I might wind up not using.

geoffmcl commented 7 years ago

@petdance zute, it sounds like my help was not desirable... no problem... in FOSS we each do what we want, when we want...

...hand all this work off to me...

Never! Absolute Rubbish! What work?

I was seeking your opinion as to how to deal with certain very specific problems/issues, if the perl HTML::Tidy was updated to the current HTML Tidy library... but if you have no interest in that, then no problem...

Wow, whether you use my upgrade or not, is no problem... that is entirely your choice...

So now I am getting the idea that I should publish this as HTML/Tidy5... obviously the 5 stands for support HTML5..

All I want if a CPAN Module that supports HTML5... use HTML/Tidy5 is all you need...

You can look at what you want, over many weekends... no pressure... no problem...

Me, I will now proceed to solve the last 5 issues, without your feedback... again no problem...

Excuse me, I will use my time, as I see fit... thankful that the fork free software license allows this...

You might not understand the differences are between the two libraries, and not want to, but some of us do... sorry for your loss... I really tried to help...

Tomorrow. or soonest, with or without your help, will solve the last 5 Windows test problems...

And with those solved will move towards a similar unix solution...

Really sorry we could not work together on this... but no problem... my fork exists only for people who want to use it... just for fun... FOSS...

petdance commented 7 years ago

it sounds like my help was not desirable...

I'm sorry if it sounded like that. That's not the case at all.

I'm just saying that I haven't had time to look at anything you've done over the past two days. I was just trying to set expectations.

I very much want to get the new Tidy in Perl bindings, too. I just don't know how yet.

geoffmcl commented 7 years ago

@petdance right now all tests PASS, in Windows...

Microsoft (R) Program Maintenance Utility Version 14.00.24210.0
Copyright (C) Microsoft Corporation.  All rights reserved.

    C:\Perl64\bin\perl.exe "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib\lib', 'blib\arch')" t/*.t
# Testing HTML::Tidy 1.58, tidyp 5.5.27, Perl 5.014002, C:\Perl64\bin\perl.exe
t/00-load.t .......... ok
t/cfg-for-parse.t .... ok
t/clean-crash.t ...... ok
t/clean.t ............ ok
t/extra-quote.t ...... ok
t/ignore-text.t ...... ok
t/ignore.t ........... ok
t/illegal-options.t .. ok
t/levels.t ........... ok
t/message.t .......... ok
t/opt-00.t ........... ok
t/parse-crash.t ...... ok
HTML::Tidy: Unknown error type: FAKE_ERROR_TYPE at t/parse-errors.t line 18
t/parse-errors.t ..... ok
t/parse.t ............ ok
t/perfect.t .......... ok
t/pod-coverage.t ..... skipped: Test::Pod::Coverage 1.04 required for testing POD coverage
t/pod.t .............. skipped: Test::Pod 1.14 required for testing POD
t/roundtrip.t ........ ok
t/segfault-form.t .... ok
t/simple.t ........... ok
t/too-many-titles.t .. ok
t/unicode-nbsp.t ..... ok
t/unicode.t .......... ok
t/venus.t ............ ok
t/version.t .......... ok
t/wordwrap.t ......... ok
All tests successful.
Files=26, Tests=78,  2 wallclock secs ( 0.03 usr +  0.11 sys =  0.14 CPU)
Result: PASS

But, as mentioned above, did not fully understand what some of the test were testing, so some fixes could be suspect...

Like adding the option force_output => 1 to remove the carp from test t/segfault-form.t. As discussed above, I feel there should be a deeper fix done in the code, but I do not presently have the skill to do that... And adding a DOCTYPE to at least 3 test to force legacy document parsing, to get the expected warnings... etc...

As the test1 branch here presently stands, it is 45 commits ahead of the upstream petdance:dev... and thus could be easily merged... maybe into a new-html5 branch, as there are already several branches with this like name, but I have no idea of their state...

Anyway, I am very happy that I have an installed and working HTML::Tidy module, in Windows, using libTidy 5.5.27 ;=))

I guess the next steps, if there are any - no expectations, rests with what you, or others want to do... thanks...

As stated before this is a great perl wrapper around current libTidy!

geoffmcl commented 6 years ago

After a lot of work, way back then, did get a Tidy.dll built and installed in my perl installation, and got a working HTML::Tidy, using library tidy version 5.5.27

 Directory of C:\Perl64\site\lib\auto\HTML\Tidy
2017-05-27  14:26           599,552 Tidy.dll

The following was my 'test', that worked with or without using the config file -

#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::Tidy;

my $use_conf_file = 1;
my ($tidy);
if ($use_conf_file) {
    #my $txt = "show-body-only: 1\ntidy-mark: 0\n";
    my $txt = "indent: 1\nwrap: 0\nshow-info: 0\n";
    my $file = 'temptidy.cfg';
    open WOF, ">$file" or die "ERROR: Unable to open $file! $!\n";
    print WOF $txt;
    close WOF;
    $tidy = HTML::Tidy->new( {'config_file' => $file} );
} else {
    #my $args = { show_body_only => 1,
    #    show_info => 0 };
    my $args = { indent => 1,
        wrap => 0,
        tidy_mark => 0,
        show_info => 0 };
    $tidy = HTML::Tidy->new( $args );
}
if (! $tidy) {
    die "ERROR: Failed to load HTML::Tidy ... $! ...\n";
}
my $vers = $tidy->tidyp_version();
if ($vers) {
    print "Using version $vers\n";
}
my $doc = <<"EOF;";
 <p>Hello Tidy!</p>
EOF;
#my $rc = $tidy->parse( '-', $doc );
#print "$clean\n";
my $clean = $tidy->clean( $doc );
print "$clean\n";
my @msg = $tidy->messages();
if (@msg) {
    my $cnt = scalar @msg;
    print "Have $cnt messages...\n";
    print join("\n",@msg)."\n";
}
# tidy04.pl - eof

So belatedly closing this...

As repeatedly stated thanks for this great perl wrapper around the current libtidy...