denoland / std

The Deno Standard Library
https://jsr.io/@std
MIT License
3.11k stars 614 forks source link

front-matter: allow YAML parsing to be reconfigured (or re-expose generic extraction) #5677

Closed mdekstrand closed 2 months ago

mdekstrand commented 2 months ago

Pre-1.0, front-matter exposed a createExtractor function to create extractors with custom parsers. I used this to use a YAML parser with a different schema. Since front-matter 1.0 only exposes the extractor functions and they have no configuration knobs, it is impossible to change the parser options (or use a different parser) with the public API.

Given that the extractAndParse function requires a format-specific regex, it looks like the best solution for my particular use case (configuring the yaml parser) would be to add an options: ParserOptions to extractYaml. Re-exposing generic front-matter extraction that returns plain text for the client to parse would also be useful for other extensions, but looks more difficult from my understanding of the current code and is not necessary for my immediate problem.

The alternatives I have considered are sticking with the last 0.2XX release of front-matter or importing extractAndParse directly from _shared.ts, but the latter looks impractical. I have tried seeing if I can get away with the standard schema, but due to legacy content the standard Yaml schema parses dates in incorrect time zones in my data.

kt3k commented 2 months ago

What yaml options do you use?

mdekstrand commented 2 months ago

I'm using {schema: "json"}.

iuioiua commented 2 months ago

The default schema that extractYaml() uses extends the json schema. So, I don't yet understand how such issues are arising. Can you please provide a minimally reproducible code snippet and instructions that worked before these changes but didn't after the upgrades?

mdekstrand commented 2 months ago

The extensions are exactly the problem — the default schema detects and parses ISO dates as JavaScript Date, whereas the the JSON schema leaves them as strings for the application to deal with later. The following code will show that the parsed value has type object and has filled in the time information with midnight UTC, while I need to be able to handle it as a string (and that passing schema: "json" to yaml.parse does so):

import { extractYaml } from "@std/front-matter";
import * as yaml from "@std/yaml";

const meta = "date: 2024-10-24";

const text = `---
${meta}
---

text`;

let parsed = extractYaml(text);
console.log("front-matter: parsed %s: %o", typeof parsed.attrs.date, parsed.attrs.date);

let yparse = yaml.parse(meta);
console.log("yaml defaults: parsed %s: %o", typeof yparse.date, yparse.date);
yparse = yaml.parse(meta, { schema: "json" });
console.log("yaml json schema: parsed %s: %o", typeof yparse.date, yparse.date);

produces the output:

front-matter: parsed object: 2024-10-24T00:00:00.000Z
yaml defaults: parsed object: 2024-10-24T00:00:00.000Z
yaml json schema: parsed string: "2024-10-24"
iuioiua commented 2 months ago

Ah, I see. Yep, let's add add an option for configurability. PRs are open to add ParseOptions from @std/yaml to extractYaml().

timreichen commented 2 months ago

Instead of adding an option, how about we split extract and parse into two separate public functions? We always can provide a function that does both as we have now but it would make the whole mod more flexible for custom use cases.

kt3k commented 2 months ago

Accepting parser as the 2nd argument might be another option:

export function extract<T>(
  text: string,
  parse_: Parser = parse as Parser,
): Extract<T> {
  return extractAndParse(text, EXTRACT_YAML_REGEXP, parse_);
}

You can specify your own parser (including 3rd party one)

import { parse } from "@std/yaml";

extract(markdown, (text) => parse(text, { schema: "json" }));
mdekstrand commented 2 months ago

@timreichen @kt3k that's exactly what the old version of front-matter did, and I'd be happy for that solution as well. The current version of front-matter has things more entangled, though, in ways I haven't taken time to fully understand, so it may no longer be practical (specifically, the underlying extract functions require a regex in addition to the parser).

mdekstrand commented 2 months ago

@iuioiua I submitted #5748 to add this.

timreichen commented 2 months ago

Accepting parser as the 2nd argument might be another option:

export function extract<T>(
  text: string,
  parse_: Parser = parse as Parser,
): Extract<T> {
  return extractAndParse(text, EXTRACT_YAML_REGEXP, parse_);
}

You can specify your own parser (including 3rd party one)

import { parse } from "@std/yaml";

extract(markdown, (text) => parse(text, { schema: "json" }));

I would strongly advocate for splitting the functionality instead of passing data through. It disentangles the two processes and allows for an additional use case when one wants to extract the frontMatter data only, for example to forward it to another api.

const string = "...";
const { frontMatter } = extract(string);
api.handleFrontMatter(frontMatter);

custom parsing also would look straight forward

const string = "...";
const { frontMatter } = extract(string);
const attrs = customParse(frontMatter);