Scraper fails to parse JSON-LD from NHM Paris page

frmichel commented 4 years ago

Hey guys, me again.

Still on the same page https://inpn.mnhn.fr/espece/cd_nom/60878, that contains several annotations in RDFa and a Taxon annotation in JSON-LD. The scraping result only shows what was extracted from RDFa, but not JSON-LD. Having a closer look, it occurs that Any23 fails while parsing the JSON-LD, I can't figure out why:

13:39:04.265 [DEBUG] org.apache.any23.extractor.SingleDocumentExtraction - Context: ExtractionContext(urn:x-any23:html-embedded-jsonld:root-extraction-result-id:https://inpn.mnhn.fr/espece/cd_nom/60878) [errors: 3] {
FATAL:  'org.eclipse.rdf4j.rio.RDFParseException: Could not parse JSONLD
    at org.eclipse.rdf4j.rio.jsonld.JSONLDParser.parse(JSONLDParser.java:74)
    at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:226)
    at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(EmbeddedJSONLDExtractor.java:149)
    at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.run(EmbeddedJSONLDExtractor.java:83)
    at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.run(EmbeddedJSONLDExtractor.java:54)
    at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:527)
    at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:263)
    at org.apache.any23.Any23.extract(Any23.java:299)
    at org.apache.any23.Any23.extract(Any23.java:434)
    at hwu.elixir.scrape.scraper.ScraperCore.getTriplesInNTriples(ScraperCore.java:245)
    at hwu.elixir.scrape.scraper.ScraperFilteredCore.scrape(ScraperFilteredCore.java:107)
    at hwu...'  (-1,-1)

There are 3 such exceptions, and there are exactly 3 separate JSON-LD scripts in the page (for types WebSite, Organization and Taxon).

If you want to reproduce, here is the logbak.xml file that I' using (java -Dlogback.configurationFile=./logback.xml -jar ...):

<configuration>

    <statusListener
        class="ch.qos.logback.core.status.NopStatusListener" />

    <timestamp key="timestamp" datePattern="yyyy-MM-dd" />

    <appender name="FILE" class="ch.qos.logback.core.FileAppender">
        <file>${timestamp}_bscFileScraper.log</file>
        <encoder>
            <pattern>
                %d{HH:mm:ss.SSS} [%level] %logger - %message%n%xException
            </pattern>
        </encoder>
    </appender>

    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <layout class="ch.qos.logback.classic.PatternLayout">
            <Pattern>
                %d{HH:mm:ss.SSS} [%level] %logger - %message%n%xException
            </Pattern>
        </layout>
    </appender>

    <logger name="org" level="error" />
    <logger name="hwu.elixir.scrape" level="debug" />
    <logger name="org.apache.any23" level="debug" />

    <root level="info">
        <appender-ref ref="STDOUT" />
        <appender-ref ref="FILE" />
    </root>

</configuration>

petrospaps commented 4 years ago

Hi Franck,

First of all, I would like to thank you for all your suggestions and bugs you have found so far.

The Any23 library is a bit tricky to work with and unfortunately I could not find a way of getting the jsonldcontext to load from a local copy (we have temporarily solved this problem by redirecting from scema.org to the direct jsonldcontext file available), please note that this change is only on the dev branch but it would be very straight forward to change in the master branch as well, please see attached file for the url you were trying to scrape and let me know if it looks OK to you.

Best wishes Petros

From: Franck Michel notifications@github.com Sent: 30 July 2020 13:20 To: HW-SWeL/BMUSE BMUSE@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [HW-SWeL/BMUSE] Scraper fails to parse JSON-LD from NHM Paris page (#54)

Hey guys, me again.

Still on the same page https://inpn.mnhn.fr/espece/cd_nom/60878, that contains several annotations in RDFa a Taxon annotation in JSON-LD. The scraping result only shows what was extracted from RDFa, but not JSON-LD. Having a closer look, it occurs that Any23 fails while parsing the JSON-LD, I can't figure out why:

13:39:04.265 [DEBUG] org.apache.any23.extractor.SingleDocumentExtraction - Context: ExtractionContext(urn:x-any23:html-embedded-jsonld:root-extraction-result-id:https://inpn.mnhn.fr/espece/cd_nom/60878) [errors: 3] { FATAL: 'org.eclipse.rdf4j.rio.RDFParseException: Could not parse JSONLD at org.eclipse.rdf4j.rio.jsonld.JSONLDParser.parse(JSONLDParser.java:74) at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:226) at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(EmbeddedJSONLDExtractor.java:149) at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.run(EmbeddedJSONLDExtractor.java:83) at org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.run(EmbeddedJSONLDExtractor.java:54) at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:527) at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:263) at org.apache.any23.Any23.extract(Any23.java:299) at org.apache.any23.Any23.extract(Any23.java:434) at hwu.elixir.scrape.scraper.ScraperCore.getTriplesInNTriples(ScraperCore.java:245) at hwu.elixir.scrape.scraper.ScraperFilteredCore.scrape(ScraperFilteredCore.java:107) at hwu...' (-1,-1)

There are 3 such exceptions, and there are exactly 3 separate JSON-LD scripts in the page (for types WebSite, Organization and Taxon).

If you want to reproduce, here is the logbak.xml file that I' using (java -Dlogback.configurationFile=./logback.xml -jar ...):

${timestamp}_bscFileScraper.log %d{HH:mm:ss.SSS} [%level] %logger - %message%n%xException %d{HH:mm:ss.SSS} [%level] %logger - %message%n%xException

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/HW-SWeL/BMUSE/issues/54, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF2PYFZ56HSYOKTKWMJWOM3R6FQRNANCNFSM4PNTWQUQ.

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

Heriot-Watt University, a Scottish charity registered under number SC000278
Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

frmichel commented 4 years ago

Hi @petrospaps, I can't see the attachment, maybe you forgot it? ;)

While I was doing tests I've noticed in the logs that the https;//schema.org page is loaded at some point, before parsing the page we want to parse. Is this what you are talking about regarding the context?

petrospaps commented 4 years ago

Hi Franck,

You should be able to get the file now.

11551.zip

At some point Any23 is making a call to schema.org to get the jsonldcontext.jsonld file in order to make the transformation to nquads, a few months ago schema.org had an issue with DoS attacks and stopped doing the redirects to that file (that is temporarily measure from what I understand). So the quick solution was to change the context of the JSONLD attribute from the schema.org entry to the direct link where the file is stored.

Please let me know if you need any more information.

Best wishes Petros

frmichel commented 4 years ago

Hi Petros,

Thx for your feedback. That definitely looks good! I've just noticed a few cases where URIs are turned into strings:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "dwc:Taxon" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/taxonRank> "http://taxref.mnhn.fr/lod/taxrank/Species" .

Regarding the object of additionalType, I think it should be interpreted as a URL since this is how it is defined in schema.org. The "dwc:" namespace has not been replaced although it is in the context, but this is likely related to the same problem.

About taxonRank, well it is not yet in schema.org. Yet, the object will be of several possible types: literal or URL. So I'm wondering if the scraper should simply look for a usual url scheme and infer the type thereof. Does it make sense?

Also, can you point me to the place where the trick happens in the dev branch?

Franck.

petrospaps commented 4 years ago

Hi Franck,

Thank you for your feedback, your suggestions make sense, but I think I will need to have further discussions and get back to you.

Please have a look at the "ScraperFilteredCore" class there is a method called "fixASingleJSONLdObject", in there it is very clear (commented out original code and added comments) of the change.

Best wishes Petros

frmichel commented 4 years ago

Thx @petrospaps, I've tested the dev branch which works fine.

To go ahead with the configuration, I've started to make a few changes. As always, soon enough I ended up changing much more than I expected at the beginning. I hope you won't mind. So, before I do a pull request, I'd prefer that you review the changes to make sure these are ok with you.

Below I describe the changes. You can see that on my fork in that branch: https://github.com/frmichel/BMUSE/tree/dev_properties

I've started from your dev branch, and merged with my earlier minor changes. So it should be up to date with your dev version.
I've merged configuration.properties and application.properties into only configuration.properties.
New singleton class hwu.elixir.utils.ScraperProperties first loads configuration.properties from the JAR, then overrides the config with local file localconfig.properties (if it exists). The nice thing is that now the config is loaded once at startup, and used where it is necessary including ChromeDriverCreator.
ScraperCore now has a member properties => the config is loaded whatever the scraper. If this is SingleURLScraper, the only property it needs is the chromiumDriverLocation and other properties are unused.
FileScraper saves the config back to localconfig.properties to keep track of the counter
To avoid the hack in ScraperFilteredCore.fixASingleJSONLdObject, I have added a property schemaContext set to https://schema.org/docs/jsonldcontext.jsonld in configuration.properties. So now in ScraperFilteredCore, you only have this: jsonObj.put("@context", properties.getSchemaContext()); If, later, schema.org gets back to the earlier situation, we shall just override it to https://schema.org in localconfig.properties, which avoid recompiling.

I've run the unit tests that are all ok. Some tests about the scraping of specific pages fail but I guess this is because the triples generated are not (no longer) exactly as expected.

Let me know what you think about all this.

Franck.

petrospaps commented 4 years ago

Hi Franck,

That is great, thank you for all your help. I have also been doing some work on bmuse, I will check everything and let you know if I find any issues with your changes.

Best wishes Petros

From: Franck Michel notifications@github.com Sent: 31 July 2020 23:42 To: HW-SWeL/BMUSE BMUSE@noreply.github.com Cc: Papadopoulos, Petros P.Papadopoulos@hw.ac.uk; Mention mention@noreply.github.com Subject: Re: [HW-SWeL/BMUSE] Scraper fails to parse JSON-LD from NHM Paris page (#54)

Thx @petrospapshttps://github.com/petrospaps, I've tested the dev branch which works fine.

To go ahead with the configuration, I've started to make a few changes. As always, soon enough I ended up changing much more than I expected at the beginning. I hope you won't mind. So, before I do a pull request, I'd prefer that you review the changes to make sure these are ok with you.

Below I describe the changes. You can see that on my fork in that branch: https://github.com/frmichel/BMUSE/tree/dev_properties

I've started from your dev branch, and merged with my earlier minor changes. So it should be up to date with your dev version.
I've merged configuration.properties and application.properties into only configuration.properties.
New singleton class hwu.elixir.utils.ScraperProperties first loads configuration.properties from the JAR, then overrides the config with local file localconfig.properties (if it exists). The nice thing is that now the config is loaded once at startup, and used where it is necessary including ChromeDriverCreator.
ScraperCore now has a member properties => the config is loaded whatever the scraper. If this is SingleURLScraper, the only property it needs is the chromiumDriverLocation and other properties are unused.
FileScraper saves the config back to localconfig.properties to keep track of the counter
To avoid the hack in ScraperFilteredCore, I have added a property schemaContext set to https://schema.org/docs/jsonldcontext.jsonld in configuration.properties. So now in ScraperFilteredCore, you only have this: jsonObj.put("@context", properties.getSchemaContext()); If, later, schema.org gets back to the earlier situation, we shall just override it to https://schema.org in localconfig.properties, which avoid recompiling.

I've run the unit tests that are all ok. Some tests about the scraping of specific pages fail but I guess this is because the triples generated are not (no longer) exactly as expected.

Let me know what you think about all this.

Franck.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HW-SWeL/BMUSE/issues/54#issuecomment-667415228, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF2PYF3I2QVUGNEVJWPKA7TR6NCD3ANCNFSM4PNTWQUQ.

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

Heriot-Watt University, a Scottish charity registered under number SC000278
Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

AlasdairGray commented 4 years ago

Hi @frmichel

Thx for your feedback. That definitely looks good! I've just noticed a few cases where URIs are turned into strings:

<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "dwc:Taxon" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/additionalType> "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" .
<https://inpn.mnhn.fr/espece/cd_nom/60878-2> <https://schema.org/taxonRank> "http://taxref.mnhn.fr/lod/taxrank/Species" .

I think the issue here is that in the original markup there is a string and a URI together

"taxonRank": [
--
  | "http://taxref.mnhn.fr/lod/taxrank/Species", "Species"
  | ],

@petrospaps is going to investigate this theory with a few tests

frmichel commented 4 years ago

I'm closing this one as it's mostly fixed, and I'll submit a separate one regarding the URIs being turned into strings.

HW-SWeL / BMUSE

Scraper fails to parse JSON-LD from NHM Paris page #54