Closed BloodDrag0n closed 6 months ago
Hi @BloodDrag0n,
nekohtml-1.9.22 is a really old one. In general htmlunit-neko is a maintained fork of nekoktml mainly used for HtmlUnit. Therefore i have removed some stuff not required by HtmlUnit to reduce the code to be maintained.
org.cyberneko.html.filters.ElementRemover
If you like to have ElementRemover class back i see three options
org.cyberneko.html.filters.Writer
The work on HTMLEntities was done during the last year. The refactorings here are mainly done for performance reasons. I guess there is a good chance to support you with some code enhancements.
Finally - is your repo public?
Hi @rbri !
Thanks for getting back to me!
In our project, we've got some classes written by overriding the ElementRemover and Writer classes. So, our plan is to extend the DefaultFilter and bring in the functionality we need. I'm reaching out because I'm not quite sure why these classes were removed, and, well, our product is closed source.
Any insights or tips you could share would be awesome!
Your plan sounds reasonable to me, i think you can start simply by using the ElementRemover code and adapt that for the current version. As mentioned above i would do it like this
Let us use this ticket to discuss further if needed
Hi @rbri, I tried to achieve the following HTMLEntities functionalities using HTMLNamedEntitiesParser
I can achieve the "get(String name)" by using
HTMLNamedEntitiesParser.get().lookup("name").resolvedValue;
But I can't find a way to achieve this method's 'public static String get(int c)' usage.
Usage - by providing character '<' and get its html entity value "lt" of <
// Returns the character associated to the given entity name, or //-1 if the name is not known. public static int get(String name) { String value = (String)ENTITIES.get(name); return value != null ? value.charAt(0) : -1; } // get(String):char //Returns the name associated to the given character or null if //the character is not known. public static String get(int c) { return SEITITNE.get(c); } // get(int):String
Is there any way to achieve this function?
Maybe you can read the same definition file (and store the reversed map in your program)
final HashMap<String, String> SEITITNE = new HashMap<>();
try (InputStream stream = HTMLNamedEntitiesParser.class.getResourceAsStream("html_entities.properties")) {
final Properties props = new Properties();
props.load(stream);
props.forEach((k, v) -> {
final String key = (String) k;
final String value = (String) v;
// we might have an empty line in it
if (key.trim().isEmpty()) {
return;
}
SEITITNE.put(value, key);
});
}
System.out.println(SEITITNE.get(">"));
Or i can provide a helper function like this for HTMLNamedEntitiesParser ...
public static void forEachNamedHtmlEntity(BiConsumer<? super Object, ? super Object> action) {
....
}
@BloodDrag0n How did you finally implement public static String get(int c)
? I am facing the same issue while porting?
@sumollodha I have adapted the IntProperties class from nekohtml-HTMLEntities and created an instance of the IntProperties class - ENTITIES. In this constructor HTMLNamedEntitiesParser(), the values from the property file "html_entities.properties" will be loaded into "RootState rootlevel" in a key-value (eg: gneq;=\u2A88), along with that process load the properties in a reverse manner, i.e. value-key (eg: \u2A88=gneq;) into "ENTITIES" and create a method like get(Int c) to fetch the value from "ENTITIES"
private final IntProperties ENTITIES = new IntProperties();
IntPrperties class instance to store the entities from property file
if (value.length() == 1) {
final int ivalue = value.charAt(0);
ENTITIES.put(ivalue, key);
}
add this code before "this.rootLevel.add(key, value);" in the constructor "HTMLNamedEntitiesParser()"
public String getEntity(int c) {
return ENTITIES.get(c);
}
Method to get entities similar like get(Int c)
static class IntProperties {
private Entry[] entries;
private IntProperties() {
entries = new Entry[101];
}
public void put(int key, String value) {
int hash = key % entries.length;
Entry entry = new Entry(key, value, entries[hash]);
entries[hash] = entry;
}
public String get(int key) {
int hash = key % entries.length;
Entry entry = entries[hash];
while (entry != null) {
if (entry.key == key) {
return entry.value;
}
entry = entry.next;
}
return null;
}
static class Entry {
public int key;
public String value;
public Entry next;
public Entry(int key, String value, Entry next) {
this.key = key;
this.value = value;
this.next = next;
}
}
}
IntProperties class from Old HTMLEntities.java from nekohtml
My project is also using org.cyberneko.html.filters.Writer but it is removed from this HtmlUnit-Neko. Are there any plans to bring it back?
@BloodDrag0n Have you implemented org.cyberneko.html.filters.Writer?
We have adapted that class into our project by extending the DefaultFilter and made changes respective to the HtmlUnit-Neko.
@garg23neha, @BloodDrag0n should i bring the org.cyberneko.html.filters.Writer back?
+1 for bringing the Writer filter back.
On Sat, Feb 3, 2024, 1:15 AM RBRi @.***> wrote:
@garg23neha https://github.com/garg23neha, @BloodDrag0n https://github.com/BloodDrag0n should i bring the filter back?
— Reply to this email directly, view it on GitHub https://github.com/HtmlUnit/htmlunit-neko/issues/79#issuecomment-1925237876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6LNOUGD5FPDETCDD45WT3YRX5Z7AVCNFSM6AAAAABBXSYRPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGIZTOOBXGY . You are receiving this because you were mentioned.Message ID: @.***>
OK - made a new SNAPSHOT build 3.12.0-SNAPSHOT
@BloodDrag0n @sumollodha @garg23neha please test
@rbri can you release the final build.
@garg23neha released :-)
@BloodDrag0n @sumollodha @garg23neha Writer and the option to get the entity for a char are back in the latest release - can i close this or do you need something more?
@rbri why have you removed setDocumentSource() method in this PR . Is there any replacement for this method? I am using this method in my project.
Hi @rbri, we are going to test the new writer class in out project and will it be possible for you to bring back the filter --> ElementRemover.java class, we are using it to blacklist some html components.
Thank You.
@BloodDrag0n
we are going to test the new writer class in out project
great, open new issues if you face and problem, and do the same for the .HTMLNamedEntitiesParser.lookupEntityRefFor(String) method
will it be possible for you to bring back the filter --> ElementRemover.java class
As always i have a lot of stuff on my desk and please remind i do all the open source stuff in my spare time...
This means, i see two options to have this back during the next weeks
@rbri I am exactly not sure how below 2 functions are mapped
public static String get(int c) { return SEITITNE.get(c); } // get(int):String
` /**
The first one takes int as argument and second one as String. Could you clarify the how port
public static String get(int c) { return SEITITNE.get(c); } // get(int):String
Thanks!
Thanks for bringing the Writer Class back! :)
I am exactly not sure how below 2 functions are mapped
@sumollodha I really sure you can figure out this by yourself ;-) (maybe a look at the unit test might help)
@rbri Can we make the fields of HTMLWriterFilter protected instead of private. Earlier Writer class had those fields protected and we were able to override those fields in child classes.
protected String fEncoding;
/**
* The print writer used for serializing the document with the
* appropriate character encoding.
*/
protected PrintWriter fPrinter;
// state
/** Seen root element. */
protected boolean fSeenRootElement;
/** Seen http-equiv directive. */
protected boolean fSeenHttpEquiv;
/** Element depth. */
protected int fElementDepth;
/** Normalize character content. */
protected boolean fNormalize;
/** Print characters. */
protected boolean fPrintChars;
Also what is the equivalent of protected boolean fSeenHttpEquiv;
in the new HTMLWriterFilter ?
@rbri I am trying to extend HTMLTagBalancer but can't do so since I am receiving There is no default constructor available in 'org.htmlunit.cyberneko.HTMLTagBalancer'
public class SelfClosingTagPreservingHTMLTagBalancer extends HTMLTagBalancer {
@Override
public void emptyElement(QName element, XMLAttributes attrs, Augmentations augs) throws XNIException
{
fDocumentHandler.emptyElement(element, attrs, augs);
}
}
Any idea on how I can get around the same?
@sumollodha just made a new 4.0.0-SNAPSHOT build with protected getters for all the internal fields to allow subclassing
fSeenHttpEquiv was never used in the old code - therefore i removed it
@sumollodha why you like to enhance the writer - do you need any more features?
have updated the snapshot again, printChars is revoved because this was also never used; have added a print method instead of giving direct access to the print writer
@rbri Thanks for bringing back the Writer. Why have you removed setDocumentSource() method in this PR . Is there any replacement for this method? I am using this method in my project.
@garg23neha setDocumentSource is also back - have fun.
@BloodDrag0n, @sumollodha , @garg23neha because this is far too long for me to track all the stuff i will close this now. Please open new issues if you need more.
The current implementation is available as 4.0.0-SNAPSHOT.
Hi, our project is using Antisamy-1.6.4 and nekohtml-1.9.22 for a very long time. We have decided to move to a newer version of the anitsamy, but antisamy has ported from nekohtml to htmlunit-neko, in our project we have separately used nekohtml for additional xss filtering by using "org.cyberneko.html.filters.ElementRemover" and "org.cyberneko.html.filters.Writer", but HtmlUnit-Neko seems to have removed these classes and also found that HTMLEntities.java has been removed and entities has been handled by HTMLNamesEntitiesParser.java.
May we know the reason for the removal of those classes and how to overcome this issue while porting?