abstract naming is misleading and not what is meant in scholarly communication

castedo commented 9 months ago

Based on limited evidence, I'm jumping to the conclusion that the abstract Rogue Scholar is generating is a copy of the beginning of the main body text. Please correct me if I am wrong.

For scholarly communication, what is desired is that the text named "Abstract" is not just a copy of the first part of the document's main body. Wikipedia probably does a better job than I at describing what is meant by an abstract: https://en.wikipedia.org/wiki/Abstract_(summary)

Just like a title is not a copy of the first sentence of a text, an abstract is not just a copy of the first few sentences of a text.

The most concrete issue here is that Rogue Scholar is redundantly filling in the abstract section in JATS and Markdown as the first few lines of the body of the text. This generates PDFs and content where the reader is presented with a (fake) abstract and then re-presented with the EXACT same text again when they start reading the body of the article/post. I think most readers will find this annoying. I have received feedback from a reader who was irritated that my intro in the main body was just a rephrasing of the abstract. I believe most scholarly readers expect to be able to read the abstract and then not have that text repeated. Abstracts that are mere copies of the first sentences are poor abstracts.

I don't claim to know the right solution, but three come to mind:

ignore this issue and keep the current behavior :laughing:
just don't include an abstract in JATS and Markdown metadata
make the abstract a structured abstract (e.g., https://www.nlm.nih.gov/bsd/policy/structured_abstracts.html) with a subheading like "COPY OF START OF TEXT:\n" or something like that.

As it stands right now, it just looks really cheesy. It's like someone is trying to check an "I-have-an-abstract" box without an author actually making an effort to write a helpful abstract for readers.

I would be inclined to go with solution 2 and not autogenerate abstracts that do not serve the function of abstracts.

mfenner commented 9 months ago

@castedo I think that short summaries of science blog posts are important, the main use cases being indexes such as Rogue Scholar and posts on social media or newsletters.

Rogue Scholar tries to fetch the specific abstract when available and when the platform supports it (Wordpress, Substack and Ghost do). But not many people use that functionality and the fallback is to use the beginning of the post. The Rogue Scholar API calls this metadata "summary", but Pandoc and Crossref prefer the term abstract. Datacite has descriptionType and controlled vocabulary that does not include summary.

My suggestion is the following:

Address issues with summary generation. The Python source code is at https://github.com/front-matter/rogue-scholar-api/blob/main/api/posts.py, and I am currently not always using the specific summary metadata because of other issues, e.g. HTML formatting and length of summary.
Generate a new abstract metadata property in the Rogue Scholar API that is only used when the abstract is different from the beginning of the blog post. * Structured abstracts are a special case of this new abstract property.
Go with solution 1 because of the use cases I mention (where the fulltext is not included). Abstracts are import for Crossref metadata - see https://i4oa.org/ so I would continue to use them there.
For outputs that use both the summary and fulltext (markdown, PDF, ePub, JATS), think about whether the summary can be omitted.

Commonmeta is the metadata schema that Rogue Scholar uses. It allows multiple descriptions, each with a type that can be abstract, summary, or description. A related use case that I have started to think about is descriptions in multiple languages, e.g. adding English-language abstracts to non-English blog posts.

castedo commented 9 months ago

I'm not totally confident I'm following all the details of your suggestion. But I think what I'm understanding makes sense to me.

The key detail you mention is that some outputs include the fulltext and others do not. A traditional abstract is included with full-text output whereas this other type of abstract-like summary is not.

There is one bit (boolean) of information here which the Rogue Scholar API has but JATS and Markdown flavors do not. This bit means roughly: "This short text summary/abstract is an auto-generated format-simplified truncated derivative of the beginning of an author's writing that does not include a traditional abstract".

Solution 1 does not seem OK for the PDF generated by Rogue Scholar. Rogue Scholar generates the PDF so it should use this one bit of information to be smart about including or not including a summary/abstract along with the fulltext.

In the case of Rogue Scholar generated JATS and Markdown I guess it depends on how those get used in real-world applications. My impression is that currently zero real-world applications that ingest JATS will omit the abstract when rendering output with fulltext. This is also my impression with pandoc injested Markdown. The abstract will be rendered along with the fulltext. So to better align with real world applications, it makes sense to me to NOT include the auto-generated summaries as an abstract in Markdown and JATS. These auto-generated summaries do not semantically mean what abstract means in real-world Markdown and JATS.

Outside the semantic context of present day JATS and Markdown, and inside the semantic context of Rogue Scholar API metadata, this one bit of information and the auto-summarization sounds like a valuable feature and idea and very useful for readers of outputs that do not include fulltext. Whether is it called "abstract" or not isn't that important but I definitely vote for NOT using the word "abstract", but I admit I'm being a bit pedantic. :sweat_smile:

mfenner commented 9 months ago

@castedo trying to address the issues you raised, I made the following changes:

Full-text formats (PDF, ePub, JATS) no longer include the summary, as it is a duplication of the beginning of the body text. I kept the summary in the markdown output, because that is the canonical format for conversions and the summary is useful, and some work went into generating it (sanitizing HTML, limiting length to 500 characters).
The Rogue Scholar backend now has a new metadata field for posts: abstract.
The API compares the excerpt provided by some blogging platforms (using a Levenstein algorithm), and stores the excerpt as abstract if it is clearly different than the beginning of the post.
This abstract is used in full-text formats. An example is https://api.rogue-scholar.org/posts/10.59350/r9pph-19985.md (markdown) or https://api.rogue-scholar.org/posts/10.59350/r9pph-19985.pdf (PDF). As the markdown has both summary and abstract, you see the clear difference.
Rogue Scholar will start encouraging bloggers to write abstracts instead of relying on auto-generated summaries.

castedo commented 9 months ago

Nice! That definitely addresses the most relevant scenarios of the issues raised here. I loose track of the norms on closing GitHub issues, but feel free to close.

Regarding the bit/boolean of information... can software injecting the Markdown easily detect whether the abstract is an auto-generated summary and not traditional summary?

Looking at the latest two RS posts, I'm not sure what is the intended behavior. This looks right because the abstract is null:

d20n9-rbx62

but what about this one?

j5jfg-n3k62 and

There's both a summary and an abstract.

mfenner commented 9 months ago

What you see in the two examples is the intended behavior. In the case of https://api.rogue-scholar.org/posts/10.59350/j5jfg-n3k62.md, you see in the markdown version that abstract are summary are different because of HTML sanitization, a better example might be https://api.rogue-scholar.org/posts/10.54900/6sz4q-47185.md (using the Ghost platform). The process to decide whether an excerpt is auto-generated from the beginning of the full-text needs to be optimized. Also, all Rogue Scholar blog posts need to re-processed for abstract detection which will take some time.

mfenner commented 9 months ago

@castedo I have tweaked the Rogue Scholar API to make it easier to see abstracts and summary: https://api.rogue-scholar.org/posts?include_fields=summary,abstract&sort=abstract&per_page=50

The tricky part is that you can't directly filter by non-empty abstracts fields. In this query abstracts with missing values are sorted last.

front-matter / rogue-scholar

abstract naming is misleading and not what is meant in scholarly communication #51