Langenscheiss / bibitnow

Site adjustors for browser plugin "bibitnow"
100 stars 16 forks source link

Feature request #2

Closed paxperscientiam closed 4 years ago

paxperscientiam commented 6 years ago

Please consider augmenting bibitnow to support parsing a schema.org object(s).

For example, from washingtonpost.com:

{
      "@context":"http://schema.org",
            "@type":"ReportageNewsArticle",
      "mainEntityOfPage":{
        "@type":"WebPage",
        "@id":"https://www.washingtonpost.com/news/energy-environment/wp/2018/01/29/its-been-a-rough-year-for-interior-secretary-ryan-zinke-and-its-still-january/"
      },
      "headline":"It’s been a rough year for Interior Secretary Ryan Zinke — and it’s still January",

        "description":"Zinke faces anger from governors, including many Republicans, over proposals to allow more drilling on land and at sea.",

      "image":["https://www.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2018/01/24/Production/Daily/A-Section/Images/Botsford171016Trump21065.JPG?t=20170517"],
      "datePublished":"2018-01-29T16:09:55.000Z",
      "dateModified":"2018-01-29T19:15:13.000Z",
      "isAccessibleForFree":"False",
      "hasPart":{
        "@type":"WebPageElement",
        "isAccessibleForFree":"False",
        "cssSelector":".paywall"
      },
      "publisher":{
        "@type":"NewsMediaOrganization",
        "name":"The Washington Post",
        "ethicsPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "masthead":"https://www.washingtonpost.com/policies-and-standards/masthead/",
        "missionCoveragePrioritiesPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "diversityPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "correctionsPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "verificationFactCheckingPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "unnamedSourcesPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "actionableFeedbackPolicy":"https://www.washingtonpost.com/policies-and-standards/",
        "foundingDate":"1877-12-06",
        "ownershipFundingGrants":"https://www.washingtonpost.com/policies-and-standards/",
        "diversityStaffingReport":"http://asne.org/newsroom_diversitysurvey",
        "refLocalNationalRequirements":null,
        "logo":{
          "@type":"ImageObject",
          "url":"https://www.washingtonpost.com/pb/resources/img/thewashingtonpost-black-400x60.png"
        }
      }
    }

It won't always have everything needed for a complete citation, but it is a standard that could be leveraged.

Thanks!

Langenscheiss commented 6 years ago

The data provided is in the json format, so you may "try-catch" JSON.parse in the preformatter then (do not apply eval afterwards of course). To read this info from script tags, you may want to try to use "textContent" instead of "innerText" (I will document this. I am currently writing a How-to guide) as query property. Once available as a json object, it is straightforward to copy the info.

Of course, it might be useful to automate this procedure on several levels. For common "schema.org" schemes, one could introduce a fixed kernel that links the properties to bibfields. In those cases, it would be enough to introduce something like "citation_json" to the preferred selectors, and let the plugin do the rest. The system could then in any case parse the data to a json object, and, if recognized, automatically apply a certain scheme. If necessary, the (raw data) preformatter could then still alter the json object to make it compatible to a certain recognized scheme. Moreover, one would want, in the same fashion as with static vs. dynamic data, to decide whether to prefer the static data from the website source or from the parsed json.

This, will, however, not happen for 0.80. After improving my github, I will improve the url matching, and switch from the old xml system (the plugin started as something else) to a simple json object and regular expressions. This should improve support for site adjustors even behind proxies which rename the url (not an uncommon situation for scientists working from home through some university library proxy)

paxperscientiam commented 6 years ago

With regard to switching to JSON, YAML might be better given greater flexibility. That said, I find typesetting it a nuisance.

Regardless, this switch, I believe, will help improve manageability of the codebase.

Langenscheiss commented 6 years ago

Apart from manageability, better usage of regex in the current implementation should minimize the chance of false positives, which in fact is theoretical, yet still a serious issue of the current version.

For the moment, I will leave it as JSON, but I can reconsider it I guess, as long as what the url list is parsed to is a json object. In principle, the system using the URL list on github is working already, but real life work has kept me a bit busy during the week, so I didn't have time to test all site adjustors. Once this is finished, I will implement a little option to tell the plugin if you are behind a URL-modifying proxy (basically just removing the proxy part defined by the user from the matched URL), and then push out 0.80 (hopefully at the weekend)

Langenscheiss commented 6 years ago

Ok, I have the first successful test with a system which reads json-ld. The idea is pretty similar to adjusters: there is a json file (at the moment just one, but as more schemata are added, this is probably going to be divided) that defines how bibfields are filled with the data if a certain schema is found. At the moment, I have only implemented the "newsarticle" schema used on Washingtonpost.

So, for contributors, there are two cases:

1.) Schema is known: you only need to provide the json data, by specifying a prefselector for "citation_json". The plugin then does the rest.

2.) schema is not known: first define how properties of the specific schema should be linked to bibfields. Then provide source of json data. I will try to shift my documentation writing endeavours to explaining how this is actually done. Most of the time, not much more than what is done for the "newsarticle" schema will be necessary, but the system does come with certain options and restrictions (similar to prefselectors). Also, the system is supposed to have a default behavior when dealing with base schemata such as Person, Organization, value,... .

The preference is the following when merging all the data prior to parsing: static data < static json-ld data < dynamically downloaded data

In the raw preformatting stage, the json source string, prior to being parsed to json is available in citation_json, and may be modified. In the preformatting stage, citation_json contains all the data extracted from the json-ld, already being assigned to the bibfields.

The new files on the github show, in case of washingtonpost, what has changed in the adjuster files. It's actually not much. With 0.81, I will ship the first iteration of the full system.

EDIT: Wow, adding indiatimes was impossible before, but now only requires one line in the prefselector (see new files). Really helpful feature! I consider adding it to the fixed extraction kernel, so that the prefselector and preformatter is not explicitly necessary anymore.

Langenscheiss commented 6 years ago

As of version 0.82, schema.org compatible data provided in json linked data script tags is processed automatically, and some news websites indeed benefit from this. One may use the "citation_json" bibfield prefselector to alter the source, but so far, the data has to come in the json format. Schemata can be linked to the bibfields via the schemata.json file.

I now have time to turn back to the documentation, after weeks of being busy and sick (the flu was terrible this year)

paxperscientiam commented 6 years ago

Hey there @Langenscheiss !

So, I'm just diving in again. It seems that auto importing schema is not working -- at least not in the zipped up dev version.

The specific example that's failing is with The Intercept. They are using the type 'NewsArticle'; however, bibitnow gets the wrong publication date.

What do you think is wrong?

EDIT: Looks like it is also not pulling the authors.

Langenscheiss commented 6 years ago

Hej.

Date: Not sure what you mean. It is March 23rd according to the article, to the json meta data, and according to bibitnow (at least here). Author: Yes, you are right here. One needs to add { "baseProperty": ["authors"] }, to the schemata.js, see new commit. Since I didn't have time so far, I could not document the feature yet.

paxperscientiam commented 6 years ago

Hm, I must have been using an old version or something; that or I just misread D:

As for the authors and such, I was under the impression that schema.org data would be imported by default for NewsArticle type.

As you can see, the Intercept's schema includes authors:

<script type="application/ld+json">
          {"@context":"http://schema.org","@type":"NewsArticle","articleSection":"Articles","authors":["Kate Aronoff"],"dateCreated":"2018-03-23T13:00:35+00:00","headline":"Climate Change Policy Is Proving Difficult To Enact Even in Liberal States with Democratic Control","keywords":["Language: English","Day: Friday","Time: 9.00","Very Long","Partner: Medium","Partner: Smart News","Partner: Spoken Layer","Partner: Uproxx"],"thumbnailUrl":"https://theintercept.imgix.net/wp-uploads/sites/1/2017/05/lamar-smith-climate-change-denier-voters-1495136715.jpg?auto=compress%2Cformat&q=90&fit=crop&w=1200&h=800","url":"https://theintercept.com/2018/03/23/climate-change-washington-state/"}
</script>
Langenscheiss commented 6 years ago

It is imported by default. It's just that I only told the system to link "author" instead of "authors" to the corresponding citation_authors bibfield. Will be fixed in the next version.

paxperscientiam commented 6 years ago

Awesome!

Hey, is there anything particular you'd like me to work on our take a look at re development?

EDIT: Just tested out your fix -- works nicely!

Langenscheiss commented 6 years ago

Thanks, appreciated!

I think you are already quite helpful in doing what you are doing. Maybe you find more popular/good/informative websites which need different schemata, or other adjusters that you/I can implement. As I said, the system is there, but it is a bit bare bone at the moment.

My next task is to continue documentation and to work on the "custom redirection link" feature that I can now more easily implement with the format string system in place. The idea is that the user can define a pattern of how the link at the bottom of the extension popup is generated from the citation data. Maybe you have a good idea for such a redirection scheme, such as "search first author on google" or something similar?

Langenscheiss commented 6 years ago

The feature should work on more websites now with 0.840. It scans for multiple ld+json script tags, and picks the first with a valid schema. This avoids querying data from dummy script tags that appear prior to the relevant one. A quick check of US News websites (the ones I can reach from Europe now with GDPR in place) via Google News shows that it's roughly 40-50% of the websites which work without adjuster.