HtmlUnit / htmlunit-neko

HtmlUnit adaptation of NekoHtml
Apache License 2.0
17 stars 13 forks source link

Support for porting from Cyberneko-NekoHTML to HtmlUnit-Neko #79

Closed BloodDrag0n closed 6 months ago

BloodDrag0n commented 8 months ago

Hi, our project is using Antisamy-1.6.4 and nekohtml-1.9.22 for a very long time. We have decided to move to a newer version of the anitsamy, but antisamy has ported from nekohtml to htmlunit-neko, in our project we have separately used nekohtml for additional xss filtering by using "org.cyberneko.html.filters.ElementRemover" and "org.cyberneko.html.filters.Writer", but HtmlUnit-Neko seems to have removed these classes and also found that HTMLEntities.java has been removed and entities has been handled by HTMLNamesEntitiesParser.java.

May we know the reason for the removal of those classes and how to overcome this issue while porting?

rbri commented 8 months ago

Hi @BloodDrag0n,

nekohtml-1.9.22 is a really old one. In general htmlunit-neko is a maintained fork of nekoktml mainly used for HtmlUnit. Therefore i have removed some stuff not required by HtmlUnit to reduce the code to be maintained.

org.cyberneko.html.filters.ElementRemover

If you like to have ElementRemover class back i see three options

org.cyberneko.html.filters.Writer

The work on HTMLEntities was done during the last year. The refactorings here are mainly done for performance reasons. I guess there is a good chance to support you with some code enhancements.

Finally - is your repo public?

BloodDrag0n commented 8 months ago

Hi @rbri !

Thanks for getting back to me!

In our project, we've got some classes written by overriding the ElementRemover and Writer classes. So, our plan is to extend the DefaultFilter and bring in the functionality we need. I'm reaching out because I'm not quite sure why these classes were removed, and, well, our product is closed source.

Any insights or tips you could share would be awesome!

rbri commented 8 months ago

Your plan sounds reasonable to me, i think you can start simply by using the ElementRemover code and adapt that for the current version. As mentioned above i would do it like this

Let us use this ticket to discuss further if needed

BloodDrag0n commented 8 months ago

Hi @rbri, I tried to achieve the following HTMLEntities functionalities using HTMLNamedEntitiesParser

I can achieve the "get(String name)" by using

HTMLNamedEntitiesParser.get().lookup("name").resolvedValue;

But I can't find a way to achieve this method's 'public static String get(int c)' usage. Usage - by providing character '<' and get its html entity value "lt" of &lt;

 // Returns the character associated to the given entity name, or
 //-1 if the name is not known.

public static int get(String name) {
    String value = (String)ENTITIES.get(name);
    return value != null ? value.charAt(0) : -1;
} // get(String):char

 //Returns the name associated to the given character or null if
 //the character is not known.

public static String get(int c) {
    return SEITITNE.get(c);
} // get(int):String

Is there any way to achieve this function?

rbri commented 8 months ago

Maybe you can read the same definition file (and store the reversed map in your program)

    final HashMap<String, String> SEITITNE = new HashMap<>();

    try (InputStream stream = HTMLNamedEntitiesParser.class.getResourceAsStream("html_entities.properties")) {
        final Properties props = new Properties();
        props.load(stream);

        props.forEach((k, v) -> {
            final String key = (String) k;
            final String value = (String) v;

            // we might have an empty line in it
            if (key.trim().isEmpty()) {
                return;
            }

            SEITITNE.put(value, key);
        });
    }

    System.out.println(SEITITNE.get(">"));
rbri commented 8 months ago

Or i can provide a helper function like this for HTMLNamedEntitiesParser ...

public static void forEachNamedHtmlEntity(BiConsumer<? super Object, ? super Object> action) {
    ....
}
sumollodha commented 8 months ago

@BloodDrag0n How did you finally implement public static String get(int c) ? I am facing the same issue while porting?

BloodDrag0n commented 7 months ago

@sumollodha I have adapted the IntProperties class from nekohtml-HTMLEntities and created an instance of the IntProperties class - ENTITIES. In this constructor HTMLNamedEntitiesParser(), the values from the property file "html_entities.properties" will be loaded into "RootState rootlevel" in a key-value (eg: gneq;=\u2A88), along with that process load the properties in a reverse manner, i.e. value-key (eg: \u2A88=gneq;) into "ENTITIES" and create a method like get(Int c) to fetch the value from "ENTITIES"

private final IntProperties ENTITIES = new IntProperties(); IntPrperties class instance to store the entities from property file

 if (value.length() == 1) {
         final int ivalue = value.charAt(0);
         ENTITIES.put(ivalue, key);
 }

add this code before "this.rootLevel.add(key, value);" in the constructor "HTMLNamedEntitiesParser()"

public String getEntity(int c) {
        return ENTITIES.get(c);
}

Method to get entities similar like get(Int c)

static class IntProperties {
        private Entry[] entries;
        private IntProperties() {
            entries = new Entry[101];           
        }

        public void put(int key, String value) {
            int hash = key % entries.length;
            Entry entry = new Entry(key, value, entries[hash]);
            entries[hash] = entry;
        }
        public String get(int key) {
            int hash = key % entries.length;
            Entry entry = entries[hash];
            while (entry != null) {
                if (entry.key == key) {
                    return entry.value;
                }
                entry = entry.next;
            }
            return null;
        }
        static class Entry {
            public int key;
            public String value;
            public Entry next;
            public Entry(int key, String value, Entry next) {
                this.key = key;
                this.value = value;
                this.next = next;
            }
        }
    }

IntProperties class from Old HTMLEntities.java from nekohtml

garg23neha commented 7 months ago

My project is also using org.cyberneko.html.filters.Writer but it is removed from this HtmlUnit-Neko. Are there any plans to bring it back?

garg23neha commented 7 months ago

@BloodDrag0n Have you implemented org.cyberneko.html.filters.Writer?

BloodDrag0n commented 7 months ago

We have adapted that class into our project by extending the DefaultFilter and made changes respective to the HtmlUnit-Neko.

rbri commented 7 months ago

@garg23neha, @BloodDrag0n should i bring the org.cyberneko.html.filters.Writer back?

sumollodha commented 7 months ago

+1 for bringing the Writer filter back.

On Sat, Feb 3, 2024, 1:15 AM RBRi @.***> wrote:

@garg23neha https://github.com/garg23neha, @BloodDrag0n https://github.com/BloodDrag0n should i bring the filter back?

— Reply to this email directly, view it on GitHub https://github.com/HtmlUnit/htmlunit-neko/issues/79#issuecomment-1925237876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6LNOUGD5FPDETCDD45WT3YRX5Z7AVCNFSM6AAAAABBXSYRPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGIZTOOBXGY . You are receiving this because you were mentioned.Message ID: @.***>

rbri commented 7 months ago

OK - made a new SNAPSHOT build 3.12.0-SNAPSHOT

@BloodDrag0n @sumollodha @garg23neha please test

garg23neha commented 7 months ago

@rbri can you release the final build.

rbri commented 7 months ago

@garg23neha released :-)

rbri commented 7 months ago

@BloodDrag0n @sumollodha @garg23neha Writer and the option to get the entity for a char are back in the latest release - can i close this or do you need something more?

garg23neha commented 7 months ago

@rbri why have you removed setDocumentSource() method in this PR . Is there any replacement for this method? I am using this method in my project.

BloodDrag0n commented 7 months ago

Hi @rbri, we are going to test the new writer class in out project and will it be possible for you to bring back the filter --> ElementRemover.java class, we are using it to blacklist some html components.

Thank You.

rbri commented 7 months ago

@BloodDrag0n

we are going to test the new writer class in out project

great, open new issues if you face and problem, and do the same for the .HTMLNamedEntitiesParser.lookupEntityRefFor(String) method

will it be possible for you to bring back the filter --> ElementRemover.java class

As always i have a lot of stuff on my desk and please remind i do all the open source stuff in my spare time...

This means, i see two options to have this back during the next weeks

sumollodha commented 7 months ago

@rbri I am exactly not sure how below 2 functions are mapped public static String get(int c) { return SEITITNE.get(c); } // get(int):String

` /**

The first one takes int as argument and second one as String. Could you clarify the how port

public static String get(int c) { return SEITITNE.get(c); } // get(int):String

Thanks!

sumollodha commented 7 months ago

Thanks for bringing the Writer Class back! :)

rbri commented 7 months ago

I am exactly not sure how below 2 functions are mapped

@sumollodha I really sure you can figure out this by yourself ;-) (maybe a look at the unit test might help)

sumollodha commented 6 months ago

@rbri Can we make the fields of HTMLWriterFilter protected instead of private. Earlier Writer class had those fields protected and we were able to override those fields in child classes.

protected String fEncoding;

/** 
 * The print writer used for serializing the document with the
 * appropriate character encoding. 
 */
protected PrintWriter fPrinter;

// state

/** Seen root element. */
protected boolean fSeenRootElement;

/** Seen http-equiv directive. */
protected boolean fSeenHttpEquiv;

/** Element depth. */
protected int fElementDepth;

/** Normalize character content. */
protected boolean fNormalize;

/** Print characters. */
protected boolean fPrintChars;
sumollodha commented 6 months ago

Also what is the equivalent of protected boolean fSeenHttpEquiv; in the new HTMLWriterFilter ?

sumollodha commented 6 months ago

@rbri I am trying to extend HTMLTagBalancer but can't do so since I am receiving There is no default constructor available in 'org.htmlunit.cyberneko.HTMLTagBalancer'

public class SelfClosingTagPreservingHTMLTagBalancer extends HTMLTagBalancer {

    @Override
    public void emptyElement(QName element, XMLAttributes attrs, Augmentations augs) throws XNIException
    {
        fDocumentHandler.emptyElement(element, attrs, augs);
    }
}

Any idea on how I can get around the same?

rbri commented 6 months ago

@sumollodha just made a new 4.0.0-SNAPSHOT build with protected getters for all the internal fields to allow subclassing

fSeenHttpEquiv was never used in the old code - therefore i removed it

rbri commented 6 months ago

@sumollodha why you like to enhance the writer - do you need any more features?

rbri commented 6 months ago

have updated the snapshot again, printChars is revoved because this was also never used; have added a print method instead of giving direct access to the print writer

garg23neha commented 6 months ago

@rbri Thanks for bringing back the Writer. Why have you removed setDocumentSource() method in this PR . Is there any replacement for this method? I am using this method in my project.

rbri commented 6 months ago

@garg23neha setDocumentSource is also back - have fun.

rbri commented 6 months ago

@BloodDrag0n, @sumollodha , @garg23neha because this is far too long for me to track all the stuff i will close this now. Please open new issues if you need more.

The current implementation is available as 4.0.0-SNAPSHOT.