Preprocessing steps - Githubissues

bwiernik commented 4 years ago

Following the discussion on Discourse, we seem to moving toward separating rendering citations from CSL data and wrangling data into clean CSL structures into discrete steps.

I think this is a reasonable conceptual distinction, could be useful for division of development labor, and it can permit speeding up of processors by letting them skip parsing if the item already is fully-specified.

Related to my concerns in the Discourse thread, I think we should make clear that these steps must occur somewhere in the CSL workflow. Whether it's the calling application or the processor (either itself or by calling an external script) is open.

I've started a list of these preprocessing steps below. It would be good if someone could go through the test suite for CSL (and maybe citeproc-js, and pandoc-citeproc) to identify others.

[ ] title part parsing from string field
[ ] name particle and suffix parsing from Given and Family parts
[ ] name parsing from string
- for legacy reasons, citeproc-js parses Family || Given. That's an easy enough extra parse beyond the above, and it's a nice format for plain text extra
[ ] locator parsing from string
[ ] page parsing from string
[ ] generation of citation-label if a format is called?
- this one probably needs to be done by processors rather than a calling application
[ ] generating name initials/determining whether letters are initials or not

One other question is how a processor should distinguish clean versus "messy" CSL data. We should identify what signals "clean" status for each field (e.g., a title having 'main' indicates it's been parsed; what about a name--the presence of all of the name-parts even if empty?).

bdarcus commented 4 years ago

One other question is how a processor should distinguish clean versus "messy" CSL data. We should identify what signals "clean" status for each field (e.g., a title having 'main' indicates it's been parsed; what about a name--the presence of all of the name-parts even if empty?).

Thoughts, with no firm conclusions.

What do you mean by "processor" here?

A CSL processor, or some pre-processor, as is the focus of this issue?

I'm assuming you mean the latter; as in, how is it supposed to know when it should work on the data.

I'd say a "main" title, as you suggest, would signal that.

On names, I don't think we should be requiring empty data structures to signal that kind of thing.

How about just the presence of a family name? That might be a little loose, in that for an institutional name it might lead to unnecessary parsing, but by far the most common case is personal contributors. And in your PR on names, you include details that place the priority on family names.

bwiernik commented 4 years ago

(I had meant the citation processor--is the data ready for citations or does it need to be pre-processed still?)

For titles, main sounds good.

For names, it's tricky. The modal case for unprocessed data is that it has either family+given or it has institution (currently literal). Almost no data exist naturally in 5-field format. Perhaps a "parsed" flag indicating that particles, etc. have already been extracted?

denismaier commented 4 years ago

(I had meant the citation processor--is the data ready for citations or does it need to be pre-processed still?)

What will happen if the data isn't yet pre-processed?

I guess these questions depend on who is responsible for bringing the data into shape. E.g., can we expect Zotero to supply data ready to be processed by a citeproc. What about other tools. If data is not in shape yet, will a citeproc call an external pre-processor or just fail or what?

bdarcus commented 4 years ago

OK, but it's possible this pre-process parsing will in fact precede, and be completely independent of, any CSL process.

But in any case, the reasons I said family name is referencing this description you added in your PR:

Use family, not literal, for personal mononyms, e.g. 'Socrates', 'Lady Gaga'

So it seems enough to just test for family (or institution).

Moving on:

Perhaps a "parsed" flag indicating that particles, etc. have already been extracted?

Why do we need to worry about this at this point? Perhaps let's just see what developers say?

What will happen if the data isn't yet pre-processed?

?

I think we need to start from the beginning.

These rules can be used to parse strings to objects for the following situations:

converting bibtex, RIS, etc. to CSL input 1.1, or CSL input 1.0 to 1.1
converting an application's internal string representation to input CSL 1.1
perhaps some CSL processors will choose to add an input pre-processor, if they're accepting titles and names as strings (say CSL input 1.0)

For the first two cases, there's effectively no unprocessed CSL input 1.1.

Only the third case, from what I can tell, will require that pre-processor determine what to parse. But that's an internal matter, for the pre-processor; isn't it?

I guess these questions depend on who is responsible for bringing the data into shape. E.g., can we expect Zotero to supply data ready to be processed by a citeproc. What about other tools. If data is not in shape yet, will a citeproc call an external pre-processor or just fail or what?

I'm thinking about this, but I guess this comes down to the question of what "in shape" means, and whether we should define the input schema in such a way that conformance means it's by definition "in shape."

bwiernik commented 4 years ago

No, “given” doesn’t mean that it’s been parsed.

This is the most common format for names data to exist:

author:
  family: de la Fontaine
  given: John Jr.

In almost all cases where a name has particles or suffixes, they are going to arrive at the processor/pre-processor with the particles/suffixes still stored in family/given. We need to consider how to mark that they family name should be treated as just the family name.

Looking at the JSON schema, I actually see that "parse-names" is already a property. Frank has already thought of this. So that’s the indicator—after names are parsed (by the calling application, user data entry, or a preprocessor), that is set to "false". A citation processor can regard a name as parsed if parse-names is false or if the name contains dropping-particle, non-dropping-particle, or suffix elements.

denismaier commented 4 years ago

What will happen if the data isn't yet pre-processed?

?

I just meant: What happens if some calling application provides e.g. titles in the old 1.0 format? Will citeprocs just reject that, or call a preprocessor? Who will be responsible?

bdarcus commented 4 years ago

I just think that's not our concern.

A CSL processor could choose, for example, to only accept 1.0 files, but internally convert them into something equivalent to 1.1. Or they could not accept 1.0 files at all, and throw an error.

But that's their decision.

So long as we define our 1.1 input schema correctly.

For example, the name definition of the current schema currently has zero required properties.

https://github.com/citation-style-language/schema/blob/v1.1/schemas/input/csl-data.json#L156

We should maybe change that to match this discussion, so that by definition any 1.1 input data is in the right form.

bdarcus commented 4 years ago

In almost all cases where a name has particles or suffixes, they are going to arrive at the processor/pre-processor with the particles/suffixes still stored in family/given. We need to consider how to mark that they family name should be treated as just the family name.

I'd like to hear from @PaulStanley and @andras-simonyi on this if they can find a bit of time.

PaulStanley commented 4 years ago

Isn't the simplest solution to say that if any particles are defined (even as empty) they won't be extracted?

bwiernik commented 4 years ago

Isn't the simplest solution to say that if any particles are defined (even as empty) they won't be extracted?

Yeah, an empty element is one way to do it. citeproc-js's current behavior is to check for "parse-names": "false".

bdarcus commented 4 years ago

So let me see if I understand this narrow issue: family name suffixes and particles:

We have properties for these in the input schema, so there are places to put these data.

Can we not simply say, consistent with the spirit of the effort on title parsing, how to parse these from the family name strings here, and instruct to put them in these properties?

Because the spec does not require parsing these. In fact, it assumes the data is parsed.

So if it doesn't require that, and specify how to do, how can we be concerned if the data is not parsed? That should be a data error; no?

Alternately, don't we need to require family name parsing and put it in the spec, and therefore require all compliant processors to do such parsing, add a bunch of tests to the test suite, etc?

andras-simonyi commented 4 years ago

I think from a practical point of view it would be indeed very useful to have a collection of typical preprocessing tasks that arise when one tries to massage metadata that is around in practice into CSL and have guidelines/recipes for parsing. On the other hand I don't necessarily see the point of standardizing the representation of unparsed data. For instance, if I want my application to process BibTeX entries I'll surely be faced with the problem of parsing, say, family names into the parts required by the (core) CSL standard to be able to feed the entries into my processor, but the question of how to represent "clean" and "still to be parsed" names will be an internal affair of my application normally decided exclusively on the basis of practical considerations.

bdarcus commented 4 years ago

I think from a practical point of view it would be indeed very useful to have a collection of typical preprocessing tasks that arise when one tries to massage metadata that is around in practice into CSL and have guidelines/recipes for parsing.

So what form would this take? A webpage with examples of in-the-wild string data, and how it should be converted, at least logically, to CSL JSON names, dates, and titles?

It does strike me that some of this parsing might be in the spec, and others not. But I'm just focus on the input data record angle.

bwiernik commented 4 years ago

There are quite a few tests in the CSL test-suite (starting with name_) illustrating particle and suffix parsing. e.g., https://github.com/citation-style-language/test-suite/blob/4c1e0b6635167018205d93db500b2daa233dab8e/processor-tests/humans/name_ParseNames.txt

andras-simonyi commented 4 years ago

So what form would this take? A webpage with examples of in-the-wild string data, and how it should be converted, at least logically, to CSL JSON names, dates, and titles? It does strike me that some of this parsing might be in the spec, and others not. But I'm just focus on the input data record angle.

Yes, if there is a collection of representative/useful input-output pairs for a task then I can imagine a 3-tier approach:

A few examples could figure in the (text version of the) standard spec., hinting (hand-waving...) at the semantics of elements/fields, e.g., clarify what is a name suffix;
a larger number of useful examples also dealing with corner cases etc., perhaps with discussions of the rationale behind them if it's not transparent, could be available on a separate web page outside the spec.;
the full list of examples would be published in a machine readable format, e.g. in JSON, but this would simply be something like an array of input-output pairs, nothing like a full-fledged test in the current test-suite. The only additional structure I can imagine is to indicate which tier an example belongs to, but I'm not sure whether this is necessary.

It would be an important advantage of using simple task-specific lists of input-output pairs in 3. that the problem of representing the unparsed input somehow in (extended?) CSL-JSON would simply go away.

bdarcus commented 4 years ago

That's very helpful @andras-simonyi - thanks!

So on 3, you mean JSON something like this?

{
  "names: [
      {
         "input": "Doe, Jane",
         "output": {
             "family": "Doe",
             "given": "Jane"
          }
      } 
  ]
}

Or separate files for each parsing type?

bdarcus commented 4 years ago

There are quite a few tests in the CSL test-suite (starting with name_) illustrating particle and suffix parsing

That's exactly what's frustrated some developers; what they mean by undocumented behavior that is nevertheless in the test-suite.

I like @andras-simonyi's idea, because it basically moves that into a helpful, but non-normative, resource apart from to the spec.

I'm thinking we could create a new repo for that, and encourage developers to submit the examples they come across, for inclusion, particularly in the JSON.

I'll create a repo just to see what it might look like and update here.

bdarcus commented 4 years ago

WDYT about this?

Currently, the content is mostly a placeholder, aside from the beginnings of the JSON schema (I would, however need to hook it up to the data schema for the output representation).

So idea is it's just a simple repo aimed at publishing both human (markdown -> html) and machine readable (json) representations.

This would be the URL for the dates json file, for example:

https://citationstyles.org/data-parsing/json/dates.json

And a start of a titles html page (would be best to include real examples though):

https://citationstyles.org/data-parsing/titles.html

So separate pages and json files for each data type.

Ideally, we'd move some of the examples from the test suite here, so the suite is only focused on what to do with correctly structured data.

Would allow developers to submits PRs, of course.

andras-simonyi commented 4 years ago

@bdarcus : thanks, this is exactly what I meant, I'd find this infrastructure very useful for development and less confusing regarding the standard.

So on 3, you mean JSON something like this? [...] Or separate files for each parsing type?

I was thinking in terms of separate files, but one file would also be totally fine I think.

Looking at the name parsing related tests in the CSL suite it occurred to me that the filenames seem to encode useful information about the nature of the individual test cases, e.g., ParsedNonDroppingParticleWithApostrophe.txt. It'd be nice to have a place for similar descriptions in the JSON schema for the examples -- at least I'd find it very useful If I could a get a more detailed description when a test case fails.

bdarcus commented 4 years ago

It'd be nice to have a place for similar descriptions in the JSON schema for the examples.

Like a description property; metadata?

That makes sense.

andras-simonyi commented 4 years ago

Like a description property; metadata?

Yes.

bdarcus commented 4 years ago

OK, here's an example of what I currently have.

    {
      "description": "Title and Subtitle, but Question Mark Delimeter",
      "input": "Whose Music? A Sociology of Musical Language",
      "output": {
        "main": "Whose Music?",
        "sub": ["A Sociology of Musical Language"]
      }
    }

I'm still debating about whether we want to maintain the source examples in JSON (which can be a hassle), or YAML.

https://github.com/citation-style-language/data-parsing/issues/1

If we went with YAML, we'd still publish the JSON (and I'd include a conversion script), but this would be the source:

- description: Title and Subtitle, but Question Mark Delimeter
  input: Whose Music? A Sociology of Musical Language
  output:
    main: Whose Music?
    sub:
    - A Sociology of Musical Language

Do you have any thoughts on this @bwiernik and @denismaier?

citation-style-language / schema

Preprocessing steps #324