BayanGroup / nutch-custom-search

65 stars 34 forks source link

Parsing Javascript script tags #19

Closed raisindetre closed 9 years ago

raisindetre commented 9 years ago

I'm attempting to extract an encoded string out of some JavaScript nested in script tags. The decode function works well with the text when I test it as a constant but I'm having trouble coercing the script contents into text so I can use regex replacement on it to isolate my string.

Basically any attempt to coerce the script returned into a string by wrapping the expression in text tags results in a string of spaces. If I don't' use the text wrapper tags, tags such as "concat" and "replace" produce parsing errors.

Is there a way to coerce something like the following into text so string functions can work on the script code (using CSS engine)?

<for-each root="script"> <expr value="script"/> </for-each>

Also, is multi-line script coerced into a single line when evaluated (i.e. newlines stripped)? Does this happen when a value is coerced into text using "text" tags? If not, can the regex expressions used in the "replace" function be set to multiline?

Thanks,

tahagh commented 9 years ago

Question: are you saying when you use a structure like this you always get "a string of spaces"?

<text>
  <expr value="script"/>
</text>

Basically this must collect the contents of all your scripts. The text function collapses all contiguous whitespaces (and newlines) into one space

raisindetre commented 9 years ago

The problem I have is that I can't coerce the contents of fields extracted by targeting a script tag without generating this error:

java.lang.string cannot be cast to org.jsoup.nodes.element extractor

This happens even if I use code like the following: <text> <last> <expr value="script"/> </last> </text>

Without the "text tags" the Nutch parseChecker outputs a single JS script block (including the script tags and newlines).

tahagh commented 9 years ago

I guess I figured out where the problem is: the script tag is interpreted as a CDATA section so calling text on it returns empty string. I'll try to resolve this issue in the next release which will come next week.