bountonw / translate

Other
4 stars 1 forks source link

Decide on main filter language for exporting to txt, html, pdf #11

Closed bountonw closed 5 months ago

bountonw commented 5 months ago

@mattleff

Currently we have one validation test written in bash. This seems clunky for major text filtration.

(Before we procede, perl is off the table at the moment.)

Here are some options for filters.

haskel -elegant filters, learning curve exiler -elegant filters, learning curve python -easier, clunky filters something-other-than-perl

I haven't purchased a domain yet, but we may want the same language as the backend of the website as we use for our filters.

I think the txt filter should be the easiest of the three exports.

Our filter language does not necessarily need to be the same as the script running the filters. Something to think about.

bountonw commented 5 months ago

golang, efficient, scales well for web continue using bash Python is slower than some of the options.

It would be nice to have something that has a straightforward way to write tests. Then when we change something we can know if we broke something.

I have more filters that I would like to implement so as to filter out common mistakes. As we add more books and include export actions, it would be nice if some of the tests could run concurrently. Some of the languages are easier than others when it comes to concurrency.

My gut feeling is that it would probably easier to port the old haskel filters into the language of choice than to try to write new filters in haskel. Scripting pandoc is the key.

mattleff commented 5 months ago

@bountonw I may not be totally understanding the question you're asking, but I guess I see two different pieces here:

  1. There will be validation (linting) that we will want to apply universally (and probably book-specifically also) to all the markdown files at each stage of translation. For this I would recommend existing tooling such as https://github.com/DavidAnson/markdownlint. This could replace validate-md.sh and would use both off-the-shelf rules and custom rules for some of our unique things (like #18).
  2. There will be filters that apply to the content as we process it for export to whatever format. These filters could do things like reformat refcodes, strip comments, apply section styles, etc. I'm not sure what language/solution is best for that yet. If we continue to use pandoc it appears that pandoc filters can be written in Haskell, Python, Ruby, and TypeScript/JavaScript (among others).

I'll admit that I'm partial to TypeScript/JavaScript, but it might not be the right language for this usecase. I have significant experience with JavaScript/TypeScript and PHP, so those are the languages I'm most comfortable with. But I can figure out whatever language we decide is best. What languages are you partial to?

bountonw commented 5 months ago

@mattleff

  1. Thank you for the proper word, linting! Yes, that is what we need to do. And if there are already linters that we can adjust for our purposes, that makes sense. It makes sense to me to use linters that are not GitHub dependent in order to be future proof.
  2. Filtering for pandoc exports: Perl and PhP are out for me, and I would prefer not using Javascript, but am not adamant about that. Lua is natively supported by pandoc. Haskel also has really good support. Between Ruby and Python, I would go with python, although I am worried that it might be too slow if we scale. When going to TeX, we have an intermediate step that could be written in any language. Meaning, .md to .tex could be any language we want and then .tex to .pdf would use an actual filter that talks to pandoc. So, lua, haskel/python, and a javascript a distant 4th. Speed is nice, and depending on if we are paying for bandwidth in the future, minimizing resources is best.
  3. The third area is the actual website that hosts the final product (backend stuff, and any scripting needed beyond the pandoc generated hmtl files.) What language do we want for that? Is it different than the other languages?

There are three types of speed, 1. speed to write, 2. speed to understand code already written, and 3. processor speed. Currently number 3 is cheapest, however, that may not always be the case and we may be running things on old used machines and may need to process the pdfs and other formats locally.

bountonw commented 5 months ago

Of course if our linter is in node.js, that may tip the scales. Hmm.

bountonw commented 5 months ago

More research later. Since lua is native to pandoc and since we will also be using luatex for pdf creation, using lua for all file transformation scripts makes sense. This number 2. Linting and website backend are separate issues.

The reason that I want to deal with this early is that I don't want to end up with a dozen languages to maintain--each contributed using the language of choice.

Are you comfortable in the choice of lua as the answer to the original post which corresponds to number 2 of language uses--the choice of language used in transforming markdown to output of choice (via pandoc)?

mattleff commented 5 months ago

@bountonw I've spent some time this afternoon looking at how pandoc Lua filters work and I think I'm convinced that Lua is a good option for export filters (need number 2). I have no experience with Lua, so this will be a learning process, but should be doable.

For linting, I'll try to put together a PR with markdownlint for us to try and see how we like it.

For the website, I'm hopeful that we can use an existing tool to handle some/most of what we need. I've worked with a number of static site builders before (such as Jekyll, Gatsby, Next.js, etc.). We could also look at more focused documentation tools, like Docusaurus. We'll have to scope out the requirements for the website to know what's the best option.

bountonw commented 5 months ago

Thank you. I'll close this and raise an issue to update the README file.