PerlAlien / File-Listing

Module to parse directoy listings
1 stars 3 forks source link

File::Listing::apache parsing of HTMLTable indexes [rt.cpan.org #50724] #7

Closed plicease closed 4 years ago

plicease commented 4 years ago

https://rt.cpan.org/Ticket/Display.html?id=50724

File::Listing version 5.814 (and others) doesn't properly parse
directory listings from Apache2.2 generated using mod_autoindex with
IndexOptions HTMLTable turned on.

I tried turning HTMLTable off in IndexOptions, but failed.  I might
figure it out someday, but in the meantime I tried to figure out how to
get File::Listing::apache to properly parse the listing.

The non-HTMLTable output from mod_autoindex would look like this:

<a href="file.ext">file.ext</a>  22-May-2009 22:35   1.0M

with HTMLTable turned on, it looks like this:
<tr><td valign="top"><img src="/icons/compressed.gif" alt="[  
]"></td><td><a href="file.ext">file.ext</a></td><td
align="right">22-May-2009 22:35  </td><td align="right">1.0M</td></tr>

There are a bunch of td and tr tags included.  The problem with this is
that there are a set of tags between the HH:MM and the file size.  The
regex in File::Listing::apache expects only space between HH:MM and the
file size.

I tried working up a regex to deal with all the extra HTML tags, but in
the end I figured it would be easier just to strip the tags out:

s/\<\/?(tr|th|td|img|font)[^\<]*\>//ig

Could you please add this stripping regex prior to the match regex?  I
think that it should work transparently.  I don't know all the flavors
of index listings that mod_autoindex can create, but hopefully this will
help more of them be dealt with.

Thanks,

Fred
plicease commented 4 years ago

There is a regex here:

https://github.com/PerlAlien/File-Listing/blob/6a21afb91bb102973c74026171102e6dbbcb8e75/lib/File/Listing.pm#L308

We can update the regexp with the appropriate example listings and tests.

plicease commented 4 years ago

I've added tests for this configuration and it seems to work for me. If you are still having trouble with the latest version, please provide raw HTML, or preferably the Apache options to reproduce the behavior.