Closed raisindetre closed 9 years ago
Question: are you saying when you use a structure like this you always get "a string of spaces"?
<text>
<expr value="script"/>
</text>
Basically this must collect the contents of all your scripts. The text function collapses all contiguous whitespaces (and newlines) into one space
The problem I have is that I can't coerce the contents of fields extracted by targeting a script tag without generating this error:
java.lang.string cannot be cast to org.jsoup.nodes.element extractor
This happens even if I use code like the following:
<text>
<last>
<expr value="script"/>
</last>
</text>
Without the "text tags" the Nutch parseChecker outputs a single JS script block (including the script tags and newlines).
I guess I figured out where the problem is: the script tag is interpreted as a CDATA section so calling text on it returns empty string. I'll try to resolve this issue in the next release which will come next week.
I'm attempting to extract an encoded string out of some JavaScript nested in script tags. The decode function works well with the text when I test it as a constant but I'm having trouble coercing the script contents into text so I can use regex replacement on it to isolate my string.
Basically any attempt to coerce the script returned into a string by wrapping the expression in text tags results in a string of spaces. If I don't' use the text wrapper tags, tags such as "concat" and "replace" produce parsing errors.
Is there a way to coerce something like the following into text so string functions can work on the script code (using CSS engine)?
<for-each root="script">
<expr value="script"/>
</for-each>
Also, is multi-line script coerced into a single line when evaluated (i.e. newlines stripped)? Does this happen when a value is coerced into text using "text" tags? If not, can the regex expressions used in the "replace" function be set to multiline?
Thanks,