claytongentry / furlex

A structured data extraction tool written in Elixir
https://hex.pm/packages/furlex
Other
45 stars 18 forks source link

Issue with duplicate meta tags #8

Closed abitdodgy closed 6 years ago

abitdodgy commented 7 years ago

When scraping a page with duplicate metatags, duplicate content is parsed as a list [head | tail].

For example:

<meta property="foo" content="bar" />
<meta property="foo" content="bar" />

Is parsed as

other: %{"foo" => ["bar" | "bar"]}

This prevents the response from being JSON encoded.

I found this while scraping this url. Notice the data under other using a | separator in the list for duplicate tags.

%{canonical_url: "https://www.kigakids.com.br/calca-skinny-milk",
  facebook: %{"fb:app_id" => ["815228055326045", " 815228055326045"],
    "og:description" => ["A Calça Skinny Milk é prática para toda hora, com o conforto que a estampa digital proporciona, pois não agride a pele do bebê com o toque áspero da estampa tradicional. ",
     "A KIGA KIDS nasceu em 2017 para buscar e oferecer uma seleta linha de roupas e acessórios para bebês e crianças até 5 anos."],
    "og:image" => ["https://cdn2.awsli.com.br/800x800/489/489262/produto/17975164/8b5396a122.jpg",
     "https://cdn.awsli.com.br/489/489262/logo/ff3efba575.png"],
    "og:locale" => ["pt_BR", "pt_BR"],
    "og:site_name" => ["KIGA KIDS", "KIGA KIDS"],
    "og:title" => ["CALÇA SKINNY MILK", "KIGA KIDS"],
    "og:type" => ["website", "website", "website"],
    "og:url" => ["https://www.kigakids.com.br/calca-skinny-milk",
     "https://www.kigakids.com.br/"]}, json_ld: [], oembed: nil,
  other: %{"description" => ["A KIGA KIDS nasceu em 2017 para buscar e oferecer uma seleta linha de roupas e acessórios para bebês e crianças até 5 anos." |
     "A Calça Skinny Milk é prática para toda hora, com o conforto que a estampa digital proporciona, pois não agride a pele do bebê com o toque áspero da estampa tradicional. "],
    "generator" => ["Loja Integrada" | "Loja Integrada"],
    "google-site-verification" => ["GbnYBmQLHGrgQRVEi4b2fzcrAA81TMh86T3Z1kDDW-c",
     "og5Ef6ntOLY0CrU0H8mURx_WwrlZc9Hz2HDXQGWOdAg" |
     "66Kpz8sWyMtS35U7Eodir6sXoV5gJe7a9kNN9xQQnYE"],
    "robots" => ["index, follow" | "index, follow"],
    "theme-color" => ["#289db9" | "#289db9"], "twitter:data1" => "DXL2DZLXB",
    "twitter:data2" => "None dia útil", "twitter:label1" => "Código",
    "twitter:label2" => "Disponibilidade",
    "viewport" => ["width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" |
     "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0"]},
  status_code: 200,
  twitter: %{"twitter:card" => "product", "twitter:creator" => "@",
    "twitter:description" => "A Calça Skinny Milk é prática para toda hora, com o conforto que a estampa digital proporciona, pois não agride a pele do bebê com o toque áspero da estampa tradicional. ",
    "twitter:domain" => "www.kigakids.com.br",
    "twitter:image" => "https://cdn2.awsli.com.br/300x300/489/489262/produto/17975164/8b5396a122.jpg",
    "twitter:site" => "@", "twitter:title" => "CALÇA SKINNY MILK",
    "twitter:url" => "https://www.kigakids.com.br/calca-skinny-milk?utm_source=twitter&utm_medium=twitter&utm_campaign=twitter"}}
abitdodgy commented 7 years ago

OK, so this problem seems to happen when the meta tag is duplicated.

claytongentry commented 6 years ago

Thanks, will check this out.

abitdodgy commented 6 years ago

It's fairly easy fix, actually. I'm not sure if I missed something, but that seems to have fixed the problem for me.

I replaced the | with a , on line 36 of the parser.

Map.put(acc, key, [to_add, value])

I didn't write any tests for it, though.

claytongentry commented 6 years ago

Hey @abitdodgy — I dug into this further. I want to make sure the list supports additional elements as the data is accumulated, so I went with a prepend function. See here: https://github.com/claytongentry/furlex/blob/91184fd383f8362e2033f0dcf60d0c6a6f655157/lib/furlex/parser/html.ex#L38

I used the source from the page you referenced as a test fixture and asserted the output is now json-encodable. Also added de-duping and extracting elements if they were the only item in a list, e.g. ["Loja Integrada", "Loja Integrada"] -> ["Loja Integrada"] -> "Loja Integrada".