HW-SWeL / BMUSE

Bioschemas Mark Up Scraper and Extractor
https://app.swaggerhub.com/apis-docs/swel/BMUSE/
Apache License 2.0
3 stars 5 forks source link

rdfa doesn't scrape well part 2 #35

Open kcmcleod opened 5 years ago

kcmcleod commented 5 years ago

Also see: https://github.com/HW-SWeL/Scraper/issues/11

This issue is based on https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fhamap.expasy.org In particular:

<form property="schema:potentialAction" typeof="schema:SearchAction" action="/cgi-bin/unirule/unirule_search.cgi" method="get" name="searchForm" id="searchForm">
    <meta property="schema:target" content="/cgi-bin/unirule/unirule_search.cgi?search={search}&context=HAMAP"/>
    <input property="schema:query-input" type="text" name="search" placeholder="Search HAMAP"/>
    <input property="schema:name" type="submit" value="Search"/>
    <input type="hidden" name="context" value="HAMAP" />
</form>

In GSDT tool produces:

Screenshot 2019-08-15 at 11 21 09

In Any23 you get:

<https://bioschemas.org/crawl/v1/100000/hamap.expasy.org/68969441> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/SearchAction> <https://bioschemas.org/crawl/v1/100000> .
<https://bioschemas.org/crawl/v1/100000/hamap.expasy.org/68969441> <https://schema.org/target> "/cgi-bin/unirule/unirule_search.cgi?search={search}&context=HAMAP" <https://bioschemas.org/crawl/v1/100000> .
<https://bioschemas.org/crawl/v1/100000/hamap.expasy.org/68969441> <https://schema.org/query-input> "" <https://bioschemas.org/crawl/v1/100000> .
<https://bioschemas.org/crawl/v1/100000/hamap.expasy.org/68969441> <https://schema.org/name> "" <https://bioschemas.org/crawl/v1/100000> .

Notice:

  1. name is empty string
  2. query-input property is also an empty string (Google just misses it out)
AlasdairGray commented 2 years ago

Needs investigation to understand the issue again