HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
613 stars 173 forks source link

Support standard citation options #1325

Open rviscomi opened 4 years ago

rviscomi commented 4 years ago

See https://scholar.google.com/intl/en/scholar/inclusion.html#indexing I'd love to see Google Scholar automatically crawling and indexing our ebook content. Edit - this was completed in #2191

We should also add citation options at the bottom of each chapter as discussed in https://github.com/HTTPArchive/almanac.httparchive.org/issues/1325#issuecomment-836071476 and https://github.com/HTTPArchive/almanac.httparchive.org/issues/1325#issuecomment-838268787

nrllh commented 4 years ago

I think, we should configure also each chapter in that way, to get better impression

tunetheweb commented 4 years ago

@rviscomi looks like the limit for Google Scholar is 5Mb

Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.

Our ebook is 17Mb so not eligible.

I think, we should configure also each chapter in that way, to get better impression

We actually include most of this data as Structured Data in the chapters already (JavaScript example).

We could include the extra meta data too in an effort to index there but again most of the chapters are large than the 5MB maximum - for example the CSS chapter comes in at 52MB in a high resolution screen once the interactive graphs have loaded! The non-interactive version with fallback images is 1.2M but not sure if scholar bot will crawl for that.

rviscomi commented 4 years ago

My understanding is the 5MB limit applies to the PDF version but not the HTML version, which can be indexed and searchable in Scholar. There may be metadata that we can add to the HTML version of the ebook to make it more search friendly.

tunetheweb commented 4 years ago

Not convinced about that:

1. File formats

Your files need to be either in the HTML or in the PDF format. PDF files must have searchable text, i.e., you must be able to search for and find words in the document using Adobe Acrobat Reader.

Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.

Also, at the moment we explicitly stop Google from indexing the HTML ebook page, to stop it competing with the PDF version:

https://github.com/HTTPArchive/almanac.httparchive.org/blob/201b7eaf3b27e686347ffb09cb926618704ca855/src/templates/base/2019/base_ebook.html#L11

Similarly only the PDF version is in our sitemap.

Would presumably need to change both of those as part of this if we did want to proceed.

rviscomi commented 3 years ago

There is some Scholar-friendly metadata we could add: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

For example:

<meta name="citation_title" content="The testis isoform of the phosphorylase kinase catalytic subunit (PhK-T) plays a critical role in regulation of glycogen mobilization in developing lung">
<meta name="citation_author" content="Liu, Li">
<meta name="citation_author" content="Rannels, Stephen R.">
<meta name="citation_author" content="Falconieri, Mary">
<meta name="citation_author" content="Phillips, Karen S.">
<meta name="citation_author" content="Wolpert, Ellen B.">
<meta name="citation_author" content="Weaver, Timothy E.">
<meta name="citation_publication_date" content="1996/05/17">
<meta name="citation_journal_title" content="Journal of Biological Chemistry">
<meta name="citation_volume" content="271">
<meta name="citation_issue" content="20">
<meta name="citation_firstpage" content="11761">
<meta name="citation_lastpage" content="11766">
<meta name="citation_pdf_url" content="http://www.example.com/content/271/20/11761.full.pdf">

Would be great to have our own content appearing in Scholar, in addition to the citations from other research papers!

nrllh commented 3 years ago

@rviscomi, I think it's interesting to provide a citing recommendation for each chapter (as text and BibTeX). See an example here. Then people know how they should cite, and all references will be uniform. Otherwise, it will be hard for the scholar to assign references to a chapter if authors reference the chapters differently. Maybe we can use this as the title in our recommendation: The {year} Web Almanac: {chapter}.

rviscomi commented 3 years ago

Really like that idea @nrllh! Adding the design label to loop in @HTTPArchive/designers to think about how to expose the citation UX.

shantsis commented 3 years ago

Is the idea of this to provide a standard MLA/latex type citation block that can be copied elsewhere?

rviscomi commented 3 years ago

@nrllh has more publishing experience and can elaborate more on his idea, but yes I think that's exactly it. For example, if I search for almanac.httparchive.org on Google Scholar I get results like this, with a button to copy a citation:

image image

nrllh commented 3 years ago

@shantsis yes, that's our goal.

shantsis commented 3 years ago

Perhaps something like this above or below the author (with whichever formats we choose) citation

rviscomi commented 3 years ago

@shantsis nice work, I like it!

Does anyone else have any feedback or suggestions? If not we can pass this to the dev team for implementation.

nrllh commented 3 years ago

Here my suggestion:

BibTex - based on this template:

@techreport{ {author1_lastname}.Almanac.{year}, author = "{author1_lastname, author1_firstname} { and author2_lastname, author2_firstname} { and author3_lastname, author3_firstname} ", title = "The {year} Web Almanac: {chapter}", institution = "HTTPArchive", year = "{year}" note = "Available as \url{url}" }

The output for security chapter 2020 will be then:

@techreport{VanGoethem.Almanac.2020,
  author      = "Van Goethem, Tom and Demir, Nurullah and Pollard, Barry",
  title       = "The 2020 Web Almanac: Security",
  institution = "HTTPArchive",
  year        = "2020",  
  note        = "Available as \url{https://almanac.httparchive.org/en/2020/security}"
}

MLA - based on this template:

{author1_lastname}, {author1_firstname}. The {year} Web Almanac: {chapter}, HTTPArchive, {year}, {url}.

The output for security chapter 2020 will be then:

Van Goethem, Tom. The 2020 Web Almanac: Security, HTTPArchive, 2020, Available as \url{https://almanac.httparchive.org/en/2020/security}.

IEEE - based on this template

{author1_firstname}[0]. {author1_lastname}, {author2_firstname}[0]. {author2_lastname}, The {year} Web Almanac: {chapter}, HTTPArchive, {year}. [Online]. {url}, Accessed on: {date_today}.

The output for security chapter 2020 will be then:

T. Van Goethem, N. Demir, B. Pollard. The 2020 Web Almanac: Security, HTTPArchive, 2020. [Online]. \url{https://almanac.httparchive.org/en/2020/security}, Accessed on 11.05.2021}.

I think APA style is irrelevant for us (s. here).

tunetheweb commented 3 years ago

Nice. Any thoughts on where it should go?

After the Conclusion, before Explore the results? Right at the bottom, just before the footer?

nrllh commented 3 years ago

After the Conclusion, before Explore the results?

I think this is a good place

shantsis commented 3 years ago

Yup or right below that and above the author. Either works :)

tunetheweb commented 3 years ago

OK seems like we have the agreed approach and the design. So I've changed the title of the issue and updated the first comment.

@HTTPArchive/developers anyone want to take this one?

VictorLeP commented 2 years ago
tunetheweb commented 2 years ago

Some of them have started to show in skeleton form - but not sure if that's because of #2191 or because they were already being cited (interestingly one shows in a translated form - which suggests it's probably the later):

image

The do say:

Keep in mind that changes that you make on your website will usually not be reflected in Google Scholar search results for some time. New papers are normally added several times a week; however, updates of papers that are already included usually take 6-9 months. Updates of papers on very large websites may take several years, because to update a site, we need to recrawl it - the time it takes to recrawl a large site is usually limited by the speed at which the target website is able to deliver content to the search robots.

Will be interesting to see if the 2021 chapters are indexed quicker since they have these. Or maybe they're just deemed a good fit for Google Scholar (despite being cited by several of the other papers). 🤷 Either way I still think it would be good to have the human readable citation options at the bottom of the page as have been asked once or twice about how to cite this officially.

VictorLeP commented 2 years ago

Some of them have started to show in skeleton form - but not sure if that's because of #2191 or because they were already being cited (interestingly one shows in a translated form - which suggests it's probably the later):

Pretty sure it's the latter, the meta data has as title "The 2019 Web Almanac: JavaScript", not what is shown in the image. I'm pretty sure Google Scholar creates [CITATION] entries for resources that are referenced in academic works but that it fails to match to any known item (also for newspaper articles, for example).

Will be interesting to see if the 2021 chapters are indexed quicker since they have these.

Indeed! Though having just looked at the Privacy chapter, the publication date is weird: 2021/05/02?

Or maybe they're just deemed a good fit for Google Scholar (despite being cited by several of the other papers).

That Google Scholar guide also mentions:

these fields must contain sufficient information to identify a reference to this paper from another document

So it might fail because it does not think there is enough information...

Either way I still think it would be good to have the human readable citation options at the bottom of the page as have been asked once or twice about how to cite this officially.

Certainly, we might cite it as well at some point :)

VictorLeP commented 2 years ago

Update from the university: since this is targeted towards a non-academic audience, they think "Scientific outreach" is the best category, so they don't consider it a book or journal. (maybe they'd be happier if we had an ISBN/ISSN)

Google Scholar also doesn't appear to have picked up the 2021 chapters (yet).

nrllh commented 2 years ago

@VictorLeP I think Google Scholar will also not index it, if we get a DOI or ISBN it'll be perfect. Our 2020 version was published in Google Books[1] (and Play Store) but it still doesn't appear in the Scholar.

[1] https://www.google.de/books/edition/The_2020_Web_Almanac/wqcPEAAAQBAJ?hl=de&gbpv=0

VictorLeP commented 2 years ago

An ISBN seems to cost $125 (in the US); a DOI can be derived from an ISBN.

There seem to be a number of ways to get only a DOI, possibly for free. It seems you usually do have to upload some file. One provider missing in those posts is OSF, which provides DOIs and has an option to "soft redirect" to a link (that is, you get a pop-up).

It might actually be nice if we could get one DOI per chapter instead of one for the Almanac as a whole.

tunetheweb commented 2 years ago

Also need to remember the translations. So at $125 per language, per chapter, per year that could add up! Though you ca often buy them in bulk much cheaper. We discussed getting ISBNs here: https://github.com/HTTPArchive/almanac.httparchive.org/issues/1219

I'm not sure we need to get into Google Scholar. It's a nice to have since we are cited in so many articles in there already. and it's potentially another way of making the content available to those that might not otherwise find it. But other than that I'm not desperate to invest in an ISBN or DOI just to get cited in there.

However I do think it would be good to tell people how to cite our articles with the above suggested addition to our web pages, since we are cited a lot and we have been asked the question before.

VictorLeP commented 2 years ago

I don't think you need to/can get an ISBN per chapter, but it would still be X years times Y languages so it could indeed get expensive fast.

A standard way to cite might actually be sufficient. As I mentioned, Google Scholar picks up on these citations, so it might be an indirect way to get indexed there. I think that my submission of the chapter metadata to the KU Leuven repository will also trigger a Google Scholar entry (albeit only for the Privacy chapter).

In terms of the citation itself, I don't really see what we couldn't go for an actual (book) chapter, for example with this (BibLaTeX!) template:

@inbook{ WebAlmanac.{year}.{chapter_number},
  author = "{author1_lastname, author1_firstname} 
       { and author2_lastname, author2_firstname} 
       { and author3_lastname, author3_firstname}",
  title = "{chapter}",
  booktitle = "{year} Web Almanac",
  chapter = {chapter_number},
  pages = "{ebook_pages}",
  publisher = "HTTP Archive",
  year = "{year}",
  url = "{url}"
}
tunetheweb commented 2 years ago

BTW we have this meta data in the chapters already:

    <meta name="citation_title" content="The 2021 Web Almanac: Privacy">
    <meta name="citation_author" content="Yana Dimova">
    <meta name="citation_author" content="Victor Le Pochat">
    <meta name="citation_publication_date" content="2021/11/17">
    <meta name="citation_journal_title" content="The 2021 Web Almanac">
    <meta name="citation_volume" content="3">
    <meta name="citation_issue" content="11">
    <meta name="citation_publisher" content="HTTP Archive">
    <meta name="citation_technical_report_institution" content="HTTP Archive">
    <meta name="citation_language" content="English">
    <meta name="citation_fulltext_html_url" content="https://almanac.httparchive.org/en/2021/privacy">

This was added in May this year.

And we've had this JSON-LD meta data in there too since the original 2019 launch:

    {
      "@context": "http://schema.org",
      "@type": "Article",
      "mainEntityOfPage": {
          "@type": "WebPage",
          "@id": "https://almanac.httparchive.org/en/2021/privacy"
      },
      "headline": "Privacy | 2021 | The Web Almanac by HTTP Archive",
      "image": {
          "@type": "ImageObject",
          "url": "https://almanac.httparchive.org/static/images/2020/privacy/hero_lg.jpg",
          "height": 433,
          "width": 866
      },
      "publisher": {
          "@type": "Organization",
          "name": "HTTP Archive",
          "logo": {
              "@type": "ImageObject",
              "url": "https://almanac.httparchive.org/static/images/ha.png",
              "height": 160,
              "width": 320
          },
        "sameAs": [
          "https://httparchive.org",
          "https://twitter.com/HTTPArchive",
          "https://github.com/HTTPArchive"
          ]
      },
    "author":

      [{
        "@type": "Person",
          "sameAs": [
            "https://almanac.httparchive.org/en/2021/contributors#ydimova"

            ,"https://github.com/ydimova"

            ],
        "name": "Yana Dimova"
      },{
        "@type": "Person",
          "sameAs": [
            "https://almanac.httparchive.org/en/2021/contributors#victorlep"
            ,"https://twitter.com/VictorLePochat"
            ,"https://github.com/VictorLeP"
            ,"https://www.linkedin.com/in/victor-le-pochat/"
            ],
        "name": "Victor Le Pochat"
      }]
,
      "description": "Privacy chapter of the 2021 Web Almanac covering adoption and impact of online tracking, privacy preference signals and browser initiatives for a privacy-friendlier web.",
      "datePublished": "2021-11-17T00:00:00.000Z",
      "dateModified": "2021-12-04T00:00:00.000Z"
    }
thibaudcolas commented 2 years ago

If all we want is a DOI, I was recommended Zenodo. It’s a CERN project, completely free, allows 50GB per upload. Takes about 2min to upload one PDF with minimal metadata, longer if we fill in a lot of details.

Here is an upload I made of three pages from this year’s accessibility chapter, on their sandbox server: https://sandbox.zenodo.org/record/1112032. Those three pages got a pretend DOI of 10.5072/zenodo.1112032. As far as I understand, even on the real server, the DOI is generated as soon as you hit "publish" and confirm.