Closed abitdodgy closed 6 years ago
OK, so this problem seems to happen when the meta tag is duplicated.
Thanks, will check this out.
It's fairly easy fix, actually. I'm not sure if I missed something, but that seems to have fixed the problem for me.
I replaced the |
with a ,
on line 36 of the parser.
Map.put(acc, key, [to_add, value])
I didn't write any tests for it, though.
Hey @abitdodgy — I dug into this further. I want to make sure the list supports additional elements as the data is accumulated, so I went with a prepend function. See here: https://github.com/claytongentry/furlex/blob/91184fd383f8362e2033f0dcf60d0c6a6f655157/lib/furlex/parser/html.ex#L38
I used the source from the page you referenced as a test fixture and asserted the output is now json-encodable. Also added de-duping and extracting elements if they were the only item in a list, e.g. ["Loja Integrada", "Loja Integrada"] -> ["Loja Integrada"] -> "Loja Integrada".
When scraping a page with duplicate metatags, duplicate content is parsed as a list
[head | tail]
.For example:
Is parsed as
This prevents the response from being JSON encoded.
I found this while scraping this url. Notice the data under
other
using a|
separator in the list for duplicate tags.