Investigate better internationalisation format/system/workflow

anki-geo / ultimate-geography

Geography flashcard deck for Anki

https://ankiweb.net/shared/info/2109889812

Other

828 stars 82 forks source link

Investigate better internationalisation format/system/workflow #143

Closed axelboc closed 3 years ago

axelboc commented 5 years ago

The current CSV format is not adequate for managing translations:

The diffs are unreadable.
Changes to translations conflict with content edits.
The growing number of columns makes data.csv difficult to work with.

CSV was good to start with, but I think we need to investigate more advanced internationalisation formats/systems/workflows if we want to encourage more contribution.

Vages commented 5 years ago

Could not agree more with your description of the problem.

Anki Deck Manager (anki-dm) seems to have one source of truth: data.csv. It is possible (and beneficial) to solve the problem without modifying anki-dm. I may be missing something, but I think the simplest way to do this on our own is to split translations over several files and merging them with a script. If you agree with the suggested solution, only two questions remain: (1) Which format should be used to store the split data? (2) How will the merge script work (algorithm/language)?

Here are my suggestions for answers to these two questions: (1) CSV should be understandable for less techy users and is the result file format. So I think it's the best file format contender. Only other format that's as widely read by different libraries and users is JSON, but I don't think nested data structures (such as JSON) are well suited for tabular data. So CSV wins that head-to-head fight. (2) Merging the files should be a simple join operation, as seen in every Databases 101 subject. Each row already has a unique ID, which could be used to refer to each item across files. Merging the files should be solvable in most high-level languages. PHP is the strongest contender, as it is already used in the project. If we want to avoid PHP (which I don't think we should want to, but anyway), my suggestion is to use Javascript (Node) or Python, as they are likely to already be installed on the user's computer.

Do you see any holes in this logic?

Vages commented 5 years ago

... (2) cont'd: A bash script could also be a contender: https://stackoverflow.com/questions/6301059/how-to-join-2-csv-files-with-a-shell-script However, I don't know how compatible this is with Windows.

Vages commented 5 years ago

Furthermore: If we go for a PHP/CSV combination, our solution is more likely to be transferred into anki-dm than if we choose some other combination.

Vages commented 5 years ago

This Stack Overflow answer seems to be of some help (to PHP noobs like myself): https://stackoverflow.com/questions/15490465/how-to-combine-rows-with-the-same-id-from-a-csv-with-php

ohare93 commented 5 years ago

Perhaps simply having one csv for each language is good enough. Of course, this is not supported by AnkiDM, however the creator of AnkiDM @OnkelTem is working on an updated version, in which multiple files are supported. I will ping him elsewhere to get his feedback on whether his new system could help with this.

P.S: Python > PHP 😉

Vages commented 5 years ago

Great, @ohare93. Perhaps he should make some public issue about it as well? I'd be glad to help out with implementing it. My Norwegian translation of Ultimate geography kind of depends on it.

Erim24 commented 5 years ago

Perhaps simply having one csv for each language is good enough.

No matter if it is supported by AnkiDM or not. I think that one file per language will make it easier to focus on the language you want to translate. On the other side it might make it a little harder to coordinate all languages and e.g. find missing translations or translate new stuff that was added in another language. Nevertheless, I would vote for one file per language

axelboc commented 5 years ago

On the other side it might make it a little harder to coordinate all languages and e.g. find missing translations or translate new stuff that was added in another language.

That's exactly right, I think we should go much further than splitting data.csv into multiple CSVs. What about setting up a proper translations management system? I have no experience in this area whatsoever, but I just came across Pontoon by Mozilla, which we could install on a free Heroku instance.

If I understand the docs correctly, the workflow would look something like this:

With a script of some sort, extract every piece of content to translate into a source locale file (in whichever format -- perhaps Fluent?)
Use Pontoon (in a browser) to translate the content into a new locale. (Pontoon also shows which new or modified pieces of content need to be translated or re-translated into existing locales.)
Upon saving, Pontoon creates/modifies the target locale files and commits the changes to the repository.
Use the target locale files to generate the translated decks somehow.

The last step could probably be done by modifying AnkiDM's build script or by running a pre-build script to merge the translations back into data.csv.

The first step is a bit trickier, as we'd have to run a command to extract any new or modified field every time we change the English deck. But if data.csv gets modified on build, would we still want to commit it in the repository? Would it make more sense to edit the English deck in a separate file? This would mean changing AnkiDM's index command...

We could also move away from AnkiDM altogether and re-implement it in a way that integrates with a translations management system better (and in any language we please). IMO, this would be the optimal, although more time-consuming, option.

@Vages, please do open a PR with your translations (in data.csv). This issue won't be resolved tomorrow, and it'd be a shame to delay releasing the new Norwegian deck. We can work with more columns in data.csv for the time being.

ohare93 commented 5 years ago

@axelboc for these types of automated build steps Github Actions is now a thing, if you opt into the beta. If possible to do this in there, it may be better as it could all be done in this repo, rather than relying on another program, especially for contributors.

aplaice commented 5 years ago

Two more things that could be improved regarding internationalisation are:

Translating the templates themselves (mainly the headings "Capital", "Location", "Flag").
Translating the deck descriptions.

(I only noticed these problems while looking at the French version of the deck (#157), though they must have bitten everyone using any of the translated decks and everyone participating in the conversation is probably aware of them, but it might be useful for reference.)

axelboc commented 4 years ago

Little update on this: I was not very successful at deploying Pontoon onto a free Heroku instance. The documentation isn't great and I ran into configuration issues. I don't think I have the skills to put a solution like this in place, unfortunately.

On further thought, I agree with you @ohare93 that it might not be such a good idea to depend on another program to maintain the deck (it could break, it would have to be updated, other people would need access and training to ensure long-term continuity, etc.)

Using a tool like Pontoon was appealing because it would have split the translations over multiple files in the repository, thus reducing conflicts, while providing us with an intuitive localisation interface, where it would have been easy to see existing translations for a given string, for instance.

If we split data.csv into multiple files without such an intuitive interface, I'm afraid it will be difficult to spot missing translations, inconsistencies between languages, broken identifiers, etc. (as @Erim24 first noted in her https://github.com/axelboc/anki-ultimate-geography/issues/143#issuecomment-531919514)

File format

Let's imagine going back to a single file for now. CSV is clearly the main issue, so which structured data format could we use instead?

JSON is great, but it's heavy on double quotes and doesn't allow comments in its standard form.
YAML is a lot lighter and allows comments. It relies on strict indentation, but we could check this on push.
TOML is interesting too. Indentation is optional, but strings do require quotes.
INI, XML ... let's skip these ones, shall we?

Unlike CSV ... JSON, YAML and TOML are all supported by Prettier (cf. #248), so ensuring proper formatting and preventing indentation errors would not be much trouble.

My vote goes to YAML, just so we don't have to worry about double quotes... and I feel like the structure would be more concise than with TOML (see below).

Content structure

Here is how I picture the content of data.yml:

- id: crr.AfnVRi
  country:
    en: England
    de: England
    es: Inglaterra
    fr: Angleterre
    nb: England
  countryInfo:
    en: Constituent country of the United Kingdom.
    ...
  capital:
    en: London
    ...
  capitalInfo:
    - en: ~ # shorthand for `null` - i.e. no value
    ...
  capitalHint:
    en: Not a sovereign country
    ...
  flag: ug-flag-england.svg # no need to pass the whole HTML
  flagSimilarity: ~ # we can make the whole field `null` instead of each language separtely
  map: ug-map-england.png

- id: ...
  ...

The translations are grouped together, which makes localisation easier, and each string is on a separate line, which allows for better diffing in git.

Since introducing a new file structure on top of data.csv requires a build step for conversion, we could adapt the structure to our needs (cf. my comments in the example above). For blurred flags, notably, the following options come to mind:

- ...
  flag: ug-flag-guam.svg
  flagBlurred: true

- ...
  flag:
    filename: ug-flag-guam.svg
    blur: true

- ...
  flag:
    default: ug-flag-guam.svg
    blurred: ug-flag-guam-blur.svg

Back to multiple files

Obviously, with the structure above, data.yml would be ginormous, but now we can imagine a different kind of split: instead of splitting the translations into multiple files, how about splitting the notes themselves? We could do this by group (e.g. by continent, by subregion, whatever), or we could even split each note into its own YAML file, a bit like in the media folder! The latter would even simplify the structure further, eliminating the need for an array structure.

So... what do you all think?

ohare93 commented 4 years ago

@axelboc I've been working on and off (mostly off) for the past half year on a little data manager program to take care of this exact problem, as I need it for other things in my life. I'm calling it 'Brain Brew', it's entirely in Python and it's on my Github (as Vocab-Flashcard-Manager right now), but dear god I need another few weeks to do some refactoring before mortal eyes may look upon it :sob: :sweat_smile:

The general premise:

Bi-directional conversion of Csv(s) and CrowdAnki exports by using an intermediate layer "Deck Parts" (Deck Headers, Note Models, and Notes). If you want to add in Markdown/any other format people want then you only need to add a converter for Markdown <-> DeckParts and suddenly Csv and CrowdAnki can convert to and from Markdown.

Highly configurable and simple Yaml config files which state how csv headers match to Note Model Fields. Multiple csvs can be merged into one as a build task, allowing for separation of unique parts. CrowdAnki exports can be imported in, told which files they associate with, and all the relevant Csv(s) and their derivitates are updates.

The goal is to be able to make a change in either Csv or Anki, and be able to pull that change into the other. I want the sheer power of scripting my own helper functions to automate some data entry, while having the flexibility to tweek the values in Anki (or just use already existing Anki plugins!)

What I have now

Csv <--> CrowdAnki is fully working, but clunkily and can be cleaned up a lot. Csvs can not yet split parts of the same note, only separate notes (I will be implementing derivatives soon). Also missing is Guid generation and media handling, but that should be straightforward. After I get these minor things done over the next few weeks (I have a programming holiday coming up, just for this :wink:) I will implement Brain Brew in my Ultimate Geography fork as a demonstration :+1:

aplaice commented 4 years ago

Using existing internationalisation frameworks

One advantage of using/adapting an existing internationalisation solution is that they (hopefully...) must have standardised solutions for, say, marking a translation as requiring an update or requiring checking. Also, they will presumably have ready-made tooling for internationalising an HTML file (the templates). With Mozilla's fluent it would require having something like this (assuming that Anki's special syntax wouldn't collide with Mozilla's tooling) for a fragment of Country - Capital.html:

<div class="type" data-l10n-id="capital"></div>
<div class="value">{{Capital}}</div>

and in, say, the German localisation file:

capital = Hauptstadt

which would be cleaner than the most easily implementable approach of having Country - Capital.en.html, Country - Capital.de.html etc.

OTOH fluent's (I'm not sure about other solutions) approach of splitting the translations by language is sub-optimal. Also, I'm not sure how well-suited it is to "intensive" translations (where effectively all of the content has to be translated).

CSV

I wouldn't completely throw CSV out of the running. Its main advantage (and simultaneously disadvantage) is that it directly reflects the "2D"-ish* nature of the data. (countries (notes) × fields)

* on reflection, with translations, it's more 3D (countries × fields × languages)...

I'm not much of a "spreadsheet person", but I find it convenient to be able to quickly navigate both by column and by row when looking at and editing the data.

The disadvantage obviously is that git isn't configured out-of-the-box to deal with non-line-based changes, while the tooling for dealing with CSVs correctly (from the point of view of diffing**, merging, and blaming) doesn't really exist or even if it did it, one couldn't rely on everybody to have it installed (and anyway GitHub wouldn't use it).

** I've managed to avoid going completely crazy by using the following script for CSV diffs:

git diff --word-diff-regex="[^[:space:],]+" "$@"`

but it's still very, very far from perfect.

Other formats

If we are to preserve the current property of two-way conversion with Anki, via CrowdAnki then the lack of "native" comments is not an issue since they'd be overwritten by an import from Anki. ("Non-native" comments can obviously be implemented as a custom field in any format.)

That said, I like the lightness of YAML compared to JSON, and I quite like the suggested formatting!

Obviously, with the structure above, data.yml would be ginormous, but now we can imagine a different kind of split: instead of splitting the translations into multiple files, how about splitting the notes themselves? We could do this by group (e.g. by continent, by subregion, whatever), or we could even split each note into its own YAML file, a bit like in the media folder! The latter would even simplify the structure further, eliminating the need for an array structure.

We'd still need a specification on how to map the YAML onto whatever intermediate format we'd be using and vice-versa (though if it were one note per file, it'd be trivial).

All of that said, I'm also looking (not too closely at the moment, since, as a mere mortal, I don't want to doom myself :p) with interest and enthusiasm at @ohare93's solution!

axelboc commented 4 years ago

@ohare93, Brain Brew (cool name btw 🍺) sounds awesome! Looking forward to seeing it work to get a better idea of what it will look like. I'm really excited about Brain Brew being extensible in terms of supported file formats!

Just so you know, I don't see a lot of value in being able to re-export the deck from Anki (I'm talking about our deck specifically). Mind you, I'm not against re-export, I just don't care if it's supported or not ... if that makes sense. Anki isn't a great tool for editing content (no search and replace, etc.), and you could only ever modify and re-export a single variant of the deck anyway.

@aplaice, internationalising the templates is not a huge priority in my opinion, though I'm now hoping Brain Brew could help with that eventually! 🎉

I do agree that using a translations management system would be ideal, especially when it comes to adding new locales (thanks to a Deepl integration, for instance...), which is why I had high hopes for Pontoon. Unfortunately, I haven't found any that is open source (or at least free) and that each of us could just download and install. Can you think of any?

On the topic of CSV, I think what I hate the most about it is having to scroll horizontally. Sure, you get an overview of all the notes and all, but unless you have a MacBook touchpad, horizontal scrolling really sucks... 😂 and the columns don't align in a text editor so you never know where to edit... and when you open it with a spreadsheet software, if you expand some of the columns then scrolling becomes clunky as it snaps to each column... I just don't think long strings of text are not meant to be organised like this.

ohare93 commented 4 years ago

@ohare93, Brain Brew (cool name btw ) sounds awesome! Looking forward to seeing it work to get a better idea of what it will look like. I'm really excited about Brain Brew being extensible in terms of supported file formats!

:grin: :+1:

Just so you know, I don't see a lot of value in being able to re-export the deck from Anki (I'm talking about our deck specifically). Mind you, I'm not against re-export, I just don't care if it's supported or not ... if that makes sense. Anki isn't a great tool for editing content (no search and replace, etc.), and you could only ever modify and re-export a single variant of the deck anyway.

That's fair, and of course it's different for each user and deck. However my best argument for why re-importing a deck back from Anki is important is all about user control and flexibility. If I am on mobile and notice a word is wrong I can fix it there, rather than make a note to do so, or just forget :sweat_smile: also while I agree Anki is not a great tool for editing flashcards there are plenty of Anki add-ons that already exist that are amazingly useful. E.g: Image Occlusion, Morphman, AwesomeTTS. The ability to keep using such great tools and not have pick between them and the structure of an autogenerated system in source control is what I want :wink: But as you said, is much much less relevant for this deck! :smile:

On the topic of CSV, I think what I hate the most about it is having to scroll horizontally. Sure, you get an overview of all the notes and all, but unless you have a MacBook touchpad, horizontal scrolling really sucks... and the columns don't align in a text editor so you never know where to edit... and when you open it with a spreadsheet software, if you expand some of the columns then scrolling becomes clunky as it snaps to each column... I just don't think long strings of text are not meant to be organised like this.

My concept for this is to split the csvs into one for each language, and have the current one with all the languages just be autogenerated by Brain Brew in some build step somewhere, ignored from source control. That would make the languages separate in terms of file structure, while still allowing people to see the big picture, should they want to. Hell, people can have any autogenerated csv they want: one with only all the capitals from each language side by side, one with all the country info, etc, etc. That way we can navigate the deck as we currently do (or better!) while keeping things more structured and separate.

That's just my idea for now, we'll see what actually functions as a workflow in practice :grin:

ohare93 commented 4 years ago

Fyi, I have Brain Brew setup in my move-to-brain-brew branch on my UG fork, with everything apart from Media moving over.

(Though I need to fix the card guids, as Anki-DM seems to double encode them for some reason :thinking: so the values in the csv do not match the cards directly, but only after they are re-coded :sweat: just planning on taking the card values and updating the csv with them!)

Just have a few tidy up things to do (like media!) and it achieves the same result as Anki-DM :grin: then it's time for the csv splitting functionality, but don't know if I'll get to that this weekend or not :sweat_smile:

If anyone wishes to play with it feel free and pester me to actually write some documentation... :cold_sweat:

Edit: I should mention, I have left in the files needed for Anki-DM, and it still works :+1: so one can run one, then the other and compare the results. Though BB only does the main deck's file right now (just haven't made the config yaml for the others) so that's the only one that will change. Also you should only need a pipenv to get BB working in this repo :+1:

ohare93 commented 4 years ago

My fork's Brain Brew branch now has now achieved the equivalent result Anki-DM :tada: Conversion generates an equal CrowdAnki file (though the note keys are in a different order though, but I'm hoping to fix that on CrowdAnki's end as it is pretty random).

Though the Note Model is pretty much hard coded just now, and I hope to improve the generation there in the near future. Note: changing the Card Templates, Deck description, etc in Anki and then running CrowdAnki to Csv on that exported file will indeed save that in the "Deck Parts" :+1:

One can run brain-brew src/build_config.yaml to convert Csv to CrowdAnki. By changing the "reverse" key in the build_config.yaml then running you can go from CrowdAnki to Csv :+1:

Media files are found recursively inside the src/deck_parts/media folder, and so they can be placed in any arbitrary folder location inside there. I have separated them into Flags and Maps for now, just for demonstration, but one can change it to literally anything. The media will will still receive updates when the conversion is done in either direction.

And I still have the main feature of Csv splitting to accomplish, so that the different languages can have their entirely unique csv :+1: Will update again once that is complete, and update the demo to write to all the language files too.

axelboc commented 4 years ago

That's amazing, congrats! I haven't had a look yet, but I will asap.

What were your thoughts on the idea of keeping translation strings together in a different format than CSV, as I suggested a few days ago, instead of splitting the languages into multiple CSVs?

ohare93 commented 4 years ago

That's amazing, congrats! I haven't had a look yet, but I will asap.

No rush, it's still a work in progress after all :grin: but any feedback will help guide me focus on the right things.

What were your thoughts on the idea of keeping translation strings together in a different format than CSV, as I suggested a few days ago, instead of splitting the languages into multiple CSVs?

I love the general structure you proposed, it would be pretty damn nice if it worked :+1: That said I can see some complexities in getting to there :sweat_smile: though the more I look at it, the more I want to take a crack at it just now haha

axelboc commented 4 years ago

I've had a first look at your branch. Here are a few questions to help me understand it all, and some initial observations:

I assume you kept most of the existing files for testing purposes, but as a result, I'm not quite sure what will stay and what will go 😄 Will the HTML template files remain, for instance?
The deck_parts folder seems to contain the output JSON split over multiple files. Is this folder meant to be committed to source control? It probably shouldn't , in my opinion.
The configuration files are quite numerous and verbose. It's probably by design, for maximum flexibility, but it's a little tricky to understand and might be prone to errors. I'm sure it can be simplified, though, and I think some documentation would help understand it better.

It definitely looks impressive and powerful. I'm looking forward to learning more about it!

ohare93 commented 4 years ago

Update: Csv splitting is now a go, and build perfectly Csv -> CrowdAnki :100: :tada: (opposite way pending, also only in my dev branch :tm:) say goodbye to the days of endless horizontal scrolling, and hello to the modern age! :grin:

You can find the examples of the current split I decided in my move-to-brain-brew branch, and they are displayable in the browser (since I added quotes to all lines, just for now :sweat:)

Here's a general example of how the info is laid out:

data-main.csv

guid	country	flag	map	tags
"e+/O]%*qfk	England	<img src=""ug-flag-england.svg"" />	<img src=""ug-map-england.png"" />	UG::Europe

data-capital.csv

country	capital	capital de	capital es	capital fr	capital nb
England	London	London	Londres	Londres	London

Brain Brew knows to join these together as it says so in the build config that data-capital.csv is a Derivative of data-main-csv

      csv_file_mappings:
        - csv: src/data/data-main.csv
          note_model: Ultimate Geography

          derivatives:
            - csv: src/data/data-country.csv
              note_model: Ultimate Geography

From that it looks for shared columns inside the derivative and finds only country. From there it finds a match for each row in main for country and adds the other columns to that dataset (capital and all the other language versions). This means that all the languages can be right next to each other for their respective column without having a completely bogged down and unusable data.csv :tada:

So in total there is now:

data-main.csv
data-country-csv
data-country-info-csv
data-capital.csv
data-capital-info.csv
data-capital-hint.csv
data-flag-similarity.csv

Have a look at each inside my move-to-brain-brew branch to get a better idea of the general layout.

Caveats

This is just a structure I thought was good, especially for demonstration, but I am open to whatever :smile: Also Brain Brew is very flexible and supports any csv joining you desire :+1:
The country column is inside all of the files, as they need something to match on. I thought this was better than guid, as it is more readable and use friendly. Though in the future if a country changes then it needs to be changed on all csvs. If it is not changed then Brain Brew will throw an error about a derivative row that has no matching parent, and the build will fail. So there is no chance of a data issue occurring from user error, in this regard.
There are many almost completely empty rows inside the csvs (Example below). That is just the nature of splitting up the data like this, but I have hopes in the future to add an override to create empty values should they be missing, as field in the config_yaml. This would remove the need for these empty rows.

country	capital	capital de	capital es	capital fr	capital nb
Coral Sea

This build of Brain Brew with Derivative Csv working is not yet published to PyPi, so if anyone wishes to test it themselves you can just pull Brain Brew install the CsvRestructure branch locally using pip install -e <path to brain brew dev>. I'd like to get Csv <- CrowdAnki functioning again (and fix all my now broken unit tests due to this massive restructuring :cold_sweat:) before publishing again.

Also I think it goes without saying but I'll say it anyways: I do not expect a PR soon! :sweat_smile: :smile: Brain Brew is still a work in progress, and things are changing rapidly with the structure. I'd like to iron out the design decisions for a while longer before I'd be comfortable even suggesting a PR. But that's where you all come in :wink: Your feedback on some fundamental questions would be fantastic:

What is the desired structure? Of course there's always a "better" way to do things just over the rainbow, but is this specific structure change to split csvs better than we have now, and therefore worth the hassle?
Is Brain Brew usable? I do not want to create a system that only I could come in and debug or change. To that end I hope to clarify the usage with some documentation soon (especially around the yaml config structure and general workflow) and make the error messages more clear.
What other features/automation could help? I want to keep developing Brain Brew, and need to know what to focus on.

Essentially I am asking will Brain Brew be the new data manager for Ultimate Geography at some point in the future? If so, what else do we think is needed? I am working on this open source tool for my own projects so development will continue regardless :grin: UG is a perfect example of something that can benefit from this, so I took it upon myself to try and implement the features needed here first :+1:

Alas, my programming holiday comes to an end on Monday :sob: so I will have less time to devote to this project (still some though!). Anyone is welcome to suggest and/or make changes though :wink:

axelboc commented 4 years ago

Haha, oh well, this answers a few of my questions, thanks! 😄

Content splinting

I do find the split by field very interesting! It seems more practical to me than a split by language. That being said, I remain convinced that a split by entity in a format like YAML would be even more beneficial for Ultimate Geography. I think it's also a safer splitting technique as a whole, as referencing the same entity across multiple files is error prone.

The ideal would be a split by language in a format like Fluent, combined with a translation management system for editing... But, as mentioned before, I don't think this is achievable.

I don't see the split as a blocker for moving to Brain Brew. I'd be happy to even keep the current unsplit CSV until we agree on (and implement) a split that we like.

Will UG move to Brain Brew?

I sure hope so!! 😂

Joke aside, Anki Deck Manager is great at what it does, but remains very limited. Translating templates, adding fields for the extended deck, etc. We've had quite a few issues that could be solved if we had a more powerful deck management system.

What I like about Anki Deck Manager, though, is that it relies on a very clear and straightforward file structure, as well as on a single, concise configuration file. If Brain Brew could achieve that too, moving to it would be a no-brainer.

I don't understand Brain Brew well enough yet to give you specific advice on how to get to the above, but I'm sure it's doable, and I'm looking forward to helping!

ohare93 commented 4 years ago

@axelboc you commented as I was already writing one, just saw yours now :sweat_smile:

I've had a first look at your branch. Here are a few questions to help me understand it all, and some initial observations:

I assume you kept most of the existing files for testing purposes, but as a result, I'm not quite sure what will stay and what will go smile Will the HTML template files remain, for instance?

90% of the files in src are not used by Brain Brew, and I only left them there so that people can still run anki-dm for comparison purposes :+1: If you look in the branch I mentioned, in src only the data and deck_parts folders are currently used, as well as the src/build_config.yaml file.

The HTML templates are not used, instead that data can be found inside deck_parts/note_models as a horrible hard coded value :disappointed: but I hope to add that in as a feature in Brain Brew to generate them in a similar way.

The deck_parts folder seems to contain the output JSON split over multiple files. Is this folder meant to be committed to source control? It probably shouldn't , in my opinion.

I have committed it to source so that others do not have to fuss about making it, as it is needed just now. I hope to make Brain Brew simply make them, should the files be missing. And then, of course, this would not be synced into source control :+1:

The configuration files are quite numerous and verbose. It's probably by design, for maximum flexibility, but it's a little tricky to understand and might be prone to errors. I'm sure it can be simplified, though, and I think some documentation would help understand it better.

There are only 2 necessary config files:

src/build_config.yaml (the one that is passed to brain_brew in the terminal command) this is the steps to take to build your thing
_config.yaml a global config setting which changes how Brain Brew operates, and points to where certain file types are so that most pathing can be eliminated from the build files.

The other two config inside src/transformers files are merely "subconfig" files, which can be used multiple times. So instead of writing:

deck_parts_to_crowdanki:
    headers: default
    notes: English.json

    file: build/Ultimate Geography/
    note_sort_order: []
    media: no
    useless_note_keys:
        __type__: Note
        data:
        flags: 0

  - deck_parts_to_crowdanki:
      headers: default
      notes: English.json

      subconfig: src/transformers/CrowdAnki-English.yaml

And include that bottom part in CrowdAnki-English.yaml so that it can be reused in multiple places/builds. Those key values are just copied into the location where subconfig is, and it is entirely optional to use a subconfig.

Again, excuse my lack of documentation on this, I hope you understand it is still very early and things are changing! :sweat_smile: Any questions are welcomed

It definitely looks impressive and powerful. I'm looking forward to learning more about it!

:100: :+1:

ohare93 commented 4 years ago

@axelboc I see there have been some interest in making more translations of the deck. Just an fyi, I have come round to your idea about custom Yaml files, and have been slowly implementing it :wink: no timeline yet, but I'll keep you informed :+1:

ohare93 commented 4 years ago

Suppose I should post here, instead of in #358 :thinking:

I have updated my Move-To-Brain-Brew branch in my own UG fork, and have it successfully transforming the standard deck into each langauge, exactly as it was working beforehand in Anki-DM. Though I hope people may find the structure itself more pleasing and usable :+1: especially having the files separated so that the csv is not a monster! :sweat_smile:

Quick notes then I have to shoot off:

Some files are already in the gitignore (such as the entire "brain_brew_build" folder (which is only named differently so that others may keep their current build folder to compare")) but other files can still go in there, like the notes yaml files. Those are just there for an example, and can even just be deleted and disabled from generating.
Install pip and pipenv, then run pipenv install to install the latest version of Brain Brew in your own local environment file and be ready to run it.
- Run the transformation with brain-brew builder_files/source_to_anki.yaml
- The transformation the other way, Anki to Source, I have not yet configured, but it will be quick to do, when I have the time :+1:
There is a folder src/anki-dm-leftovers with some files pertaining to the Extended deck, which I have left in the repo since I have not yet made the extended deck functional. I will delete them after I do that. All other files related to anki-dm have been deleted.
The combined csv is still in the repo, until the time at which the full switch occurs. I have left in the builder file combined_data_to_source.yaml which takes the data from the old one and splits it into the rest of the csvs, so we may use it later. Feel free to play around with it, ofc.
The builder file, that states how to do the transformation, is source_to_anki.yaml and may look a bit funky if you've never seen Yaml Anchors and Aliases before. They are fully Yaml specific, nothing to do with Brain Brew, and see this page to see just how fucking great they are: https://medium.com/@kinghuang/docker-compose-anchors-aliases-extensions-a1e4105d70bd DRY principle in preserved! :grin: Example:

 - generate crowd anki:
    <<: &default_crowd_anki_gen
      headers: default header
      note_models:
        deck_parts:
          - deck_part: Ultimate Geography
      media:
        from_notes: true
        from_note_models: true

    folder: brain_brew_build/Ultimate Geography
    notes:
      deck_part: english
- generate crowd anki:
    <<: *default_crowd_anki_gen
    folder: brain_brew_build/Ultimate Geography_de
    notes:
      deck_part: german

If anyone has any questions, please do shoot :+1:

axelboc commented 4 years ago

Looks awesome!! 👏 😍

I think I understand most of it and I like the CSVs split by columns, it's already such a huge improvement! The config is also a lot more understandable now.

Looking at combined_data_to_source.yaml, how do you tell Brain Brew which columns to put in each derivative? I don't quite understand that part.
The German capital hints in src/data/split/capital_hint.csv are not output properly. There must be a typo somewhere in the builder files, but I couldn't find it.

Overall, I have a much clearer picture of Brain Brew and a lot of feedback! I'll open an issue on your repo.

ohare93 commented 4 years ago

Looks awesome!! 👏 😍

I think I understand most of it and I like the CSVs split by columns, it's already such a huge improvement! The config is also a lot more understandable now.

I am very happy to hear that! 😁

Looking at combined_data_to_source.yaml, how do you tell Brain Brew which columns to put in each derivative? I don't quite understand that part.

Csv transformations are incredibly flexible, and the answer is that it takes it from the headers of the csv itself. Any that match the known data, that is. So if a csv has a column header, and that header is a key in the data, then it will get those rows. Derivatives are a little different, in that they are only valid as derivatives if they share one or more columns with the parent, and then they will only get the rows that match all of those overlapping columns. However, currently one must manually add in that row for the rest of the data to flow in in a transformation. So one must add a row with the Country 'England' to the Country derivative, in order to have the data flow into there. I hope to add in an option to automatically add in values into here, if there is data that exists that matches any of the other columns, but there is a limitation that stops me for now.

The German capital hints in src/data/split/capital_hint.csv are not output properly. There must be a typo somewhere in the builder files, but I couldn't find it.

Hmm, will look into this 🤔

Overall, I have a much clearer picture of Brain Brew and a lot of feedback!

👏

axelboc commented 4 years ago

Quoting @aplaice from https://github.com/axelboc/anki-ultimate-geography/pull/364#issuecomment-705226706

I have doubts whether per-field CSVs (capital.csv etc.) are better than per-language ones (french.csv etc.)

I have trouble making up my mind on this as well.

Translation management systems always keep translations separate from one another, in separate files. I guess the benefit of this is that each language can be developed in isolation, at its own pace and by different contributors ... but this doesn't seem to be the workflow we've been following so far: we've been trying to keep translations in sync as much as possible, and we contribute to and review (to various extents) all of the translations ourselves.

So I would tend more towards keeping the translations together like in the per-field split. Another thing I like about this split is that, for instance, countries that don't have capitals don't show up in the CSV for the capital field -- it's nice to not have as much clutter when reviewing a field's wording consistency across notes, notably.

On the topic of YAML, in https://github.com/axelboc/anki-ultimate-geography/issues/143#issuecomment-608562343 I had suggested a format of the sort:

- id: crr.AfnVRi
  country:
    en: England
    de: England
    es: Inglaterra
    fr: Angleterre
    nb: England
  countryInfo:
    en: Constituent country of the United Kingdom.
    ...
  ...

Obviously, this could still become quite large if we ended up with fifty translations... but so would a CSV file for a single field... the difference is vertical vs horizontal, and I think vertical is better for people (like me) who don't have a MacBook touchpad :D

In https://github.com/axelboc/anki-ultimate-geography/issues/143#issuecomment-608562343, I had also suggested that we could split the YAML file by note (one file per note like in the media folder), or by group (i.e. continent, dependent territories, or whatever).

YAML, and more so one YAML file per note, sure has some downsides -- the main one that comes to mind is checking wording consistency for a specific field across notes. That being said, my feeling is that it might be the solution that is the most flexible and scalable for us in the long term.

One thing I haven't considered yet is which format and split would be best for when we start having "extension" repositories. Should translations be managed in such repositories, for instance? If an extension repository were to include a new field for a country's languages, for instance, where would translations for this extension be stored? A lot of challenges ahead... :D Fortunately, as you said @aplaice, whatever we decide now can always be changed later!

ohare93 commented 3 years ago

Issue resolved :tada: Other improvement such as Yaml source files or federated decks can have their own issues/discussions :+1: