dracor-org / fredracor

French Drama Corpus
5 stars 1 forks source link

Introduce slug and dates in "ids.xml" #11

Closed lehkost closed 2 years ago

lehkost commented 3 years ago

In order to be able to add missing dates (or correct erroneous ones), we should introduce a corresponding option in "ids.xml".

The same goes for a better readable slug that divides the words in a meaningful way.

Both demonstrated with this example of "Sermon Joyeux de Bien Boire" (DraCor ID: fre000037):

<play id="fre000037" file="ANONYME_SERMONJOYEUX.xml"/>

will be:

<play id="fre000037" file="ANONYME_SERMONJOYEUX.xml" slug="anonyme-sermon-joyeux-de-bien-boire" print="1545" premiere="" written=""/>

If available, this additional data would be written into the DraCor files when transforming the original files (if available, these dates would also override dates from the original files, since sometimes there are discrepancies between first print edition [whose date we collect, i.e., Datum des Erstdrucks] and the edition used by TC).

Also, ranges à la notBefore and notAfter are possible, in these cases the two year numbers are spearated by an en dash (–).

cmil commented 3 years ago

Before implementing the slug, I'd like to make sure we have a common understanding of what we want to achieve. The idea (at least mine) of using names instead of IDs for filenames and URL slugs was to make those easy to recognise and memorable. IMO the example slugs added in 4ba115719a1eb9b2901a97a24fe7b5f406c0672e tend to deviate from that idea. Some of them are way too long to be memorable and it takes quite some visual parsing to distinguish the different farces nouvelles from each other. I think we should stick to shorter slugs, also considering that they would end up as file names where an excessive length would be rather impractical for various reasons.

Also, underscores in the original filenames are already replaced by dashes when adding the files to fredracor, so slug overrides like anonyme-adam or most of the racine-* ones wouldn't actually be necessary.

lehkost commented 3 years ago

All very good points, and I agree we should put purpose first. So we might abandon the slug idea in general. Or, inherit the filename from TC, lowercase and dash it as we already do, but additionally divide words in our slugs.

E.g. "anonyme-resurrectionjeninlandore" → "anonyme-resurrection-jenin-landore"

If we'd be a commercial website, I'd argue that we do this for SEO reasons. 😊 But since we aren't, I'd argue that this leads to better readability (also since URLs/slugs will land in our data files and spreadsheets for analyses).

So, if we decide to do it this way, should we strive for an automatic solution that uses the title pattern to divide slugs, or manually?

lehkost commented 3 years ago

I just committed the compromise proposed above (https://github.com/dracor-org/fredracor/commit/28edf108e4d9ed11de18d67b53d894fd051cf343).

Slugs now feature the exact same letters as in the original TC, but individual words are separated with dashes (both authors and abbreviated titles). I kept the underscore between author(s) and titles, in case we need it later. Other than that, underscores can still be transformed to dashes in the URL.

Examples:

I did not remove obvious spelling mistakes (there are just a few), like this:

cmil commented 3 years ago

@lehkost please have a look at the above PR. I replaced all underscores with hyphens for simplicity. I also removed all slug attributes that are not strictly necessary because they affect the performance of the transformation script. I also updated the staging database with the new versions.