ankitects / anki

Anki's shared backend and web components, and the Qt frontend
https://apps.ankiweb.net
Other
18.72k stars 2.13k forks source link

refactor import/export code #1018

Open dae opened 3 years ago

dae commented 3 years ago

I'd like to refactor the import/export code soon, moving some or most of it to Rust. This will probably come after the scheduling rework.

When working on this:

hgiesel commented 3 years ago

I did work on some SpecialFields PRs, and knowing that you intended to refactor the export/import logic soon, I took some notes. While SpecialFields makes the importing algorithm much more powerful, I don't think the abstractions it provides are necessarily easy to understand. So the following are some ideas of how to better abstract the functionality SpecialFields provides:

Changing Note Type upon import

Private Fields:

Private Tags:

Update even if note type or note has older mod value

Update deck description

Private vs Public export

Of the aforementioned improvements I think "Changing Note Type upon import" generally makes sense to implement.

But for the other changes, they particularly make sense for sharing decks from the "deck maintainer" to the "deck user". However the current Anki export/import behavior would still make sense when the sharing happens between "user" and "user", or if a user wants to back-up his deck/progress.

I think two export/import behaviors could be grouped as follows:

  1. A public / shared export - import:

    • Don't export scheduling
    • Don't export private fields like described above
    • Don't export private tags like described above
    • Force update even if the mod is older
    • Update deck description (even if older) (?)
  2. A private / backup export - import:

    • Export scheduling
    • Export all fields, including private
    • Export all tags, including private
    • Don't import if mod is older (You can trust yourself, that the most recent change you made is the one you want)
    • Update deck description only if newer (?) For that it would also need a mod value, probably on deck? I'm not sure that's the case right now.

While it might still might make sense to allow for individual turning on-off of those features (?), I think abstracting them that way, would prevent overwhelming the user at first. (EDIT: Talking to Nick, he confirmed, that indeed all individual settings would be necessary)

AnKingMed commented 3 years ago

Happy to answer any questions in regards to this. @hgiesel has a great outline here

dae commented 3 years ago

Thanks Henrik, that looks really helpful. The scheduler work will likely come first (I've already done some initial work on it), but will come back to this.

dae commented 3 years ago

For reference: https://forums.ankiweb.net/t/import-problems-suggestions/5856

dae commented 3 years ago

Also for reference: https://forums.ankiweb.net/t/importing-rewrite-note-sort-order/13449

dae commented 3 years ago

Leading/trailing whitespace causing issues: https://forums.ankiweb.net/t/import-export-causes-unwanted-duplicates/14060

dae commented 2 years ago

Another one: https://forums.ankiweb.net/t/stats-ruined-when-importing-decks-from-others/15901/4

dae commented 2 years ago

Specifying deck: https://forums.ankiweb.net/t/deck-field-option-for-file-import-csv/18275/3

abdnh commented 2 years ago

Reminder to add checks for invalid card IDs: https://forums.ankiweb.net/t/issue-with-huge-card-ids/12165

dae commented 2 years ago

Just a brief update on this:

Rumo has implemented .colpkg import/export, and is currently looking into apkg exporting.

The plan is to approach the update in two steps:

1) porting the existing built-in functionality, without trying to tackle all the points mentioned above 2) once we've got a solid base, incrementally adding the new functionality/features mentioned above

While we work on the two steps, the old Python code can be kept around, so that we don't break existing workflows using the SpecialFields add-on.

dae commented 2 years ago

New apkg and csv import/export implementations are now in main, with functionality roughly equivalent to the old Python code. Now we can turn our attention to the new functionality suggested above.

@hgiesel gave us a great summary. Some thoughts/discussion starters on the various features:

Private fields:

Henrik's suggested approach sounds good here - we could use a checkbox like "[x] Exclude in shared decks" in the Fields dialog.

Private tags:

Private vs public export:

I quite like this idea - we already use the scheduling option to control whether flags should be stripped, and the two use cases are quite different. @AnKingMed, could you explain in more detail which of the lumped settings you want to be able to control independently, and why?

Updates when notetype has changed:

This is probably the hardest part. I think it gets even more complicated if we try allow the user to merge fields, or worry about preserving the content of fields that no longer exist.

Brainstorming here - how does the following sound?

AnKingMed commented 2 years ago

Private fields:

Private tags:

Private vs public export:

image

Note that the special fields add-on is just for importing. Ideally, exporting would also be considered and users could choose to not export certain fields. We've also recently made an add-on for "export single tag for sharing" so that users can export just a single tag and strip all the other tags out. This is really useful when you tag cards by a lecture, export just that tag, and the person you share with can then import that file with "all fields are special" and "combine tagging" so that it essentially just imports that one tag without affecting any field content or any of the existing tags https://ankiweb.net/shared/info/960563361

hgiesel commented 2 years ago

Private tags:

I think you suggested to make that list of prefixes configurable, right? I'd argue against that. Tags are something that is used across decks and notetypes. Having configurable tags could make sharing more complicated than it needs to be.

Example: User A exports a deck with notes that are tagged abc. User B now imports this deck, but has abc configured as private. Will the tag be imported or not? And if it is imported, is immediately recognized as private in B's collection, which would make passing it on to user C more complicated?

Instead, what do you think about not making it configurable, but use a prefix or infix of #? Of course marked and leech would also be private for a legacy reason.

image

E.g. this subtree would show the tags medicine::tag and medicine::#::todo. So any tags within #:: are inlined into their parent. And the private tag would mark it explicitly for the user.

Any repeated #:: would then be treated normally again. The "magic behavior" would only pertain to the first use of a # subtag.

Private vs public export:

AnKingMed commented 2 years ago

There are times when a user has 100 tags and user B has the same 100 tags (using the same shared deck). User B chooses to alter some of those 100 tags, but also adds a new tag. They want to share that new tag with user A, but aren't ready to share the other changes yet. In this case, there needs to be a way to share just that singular tag without updating the other tags

For deck description, there are many instances in sharing updates with each other to compile things that I've needed to protect the deck description (i.e. I as the deck creator had updated it, but I was importing files from other people for the new update and they had the old description).

For 99% of people, they need just a couple functions, the other functions are for the deck creators and maintainers. Although it's significantly less people using those functions, they're absolutely necessary. I've used every combination of the settings I outlined above in creating and maintaining decks and have a few years of experience with it :)

dae commented 2 years ago

User A exports a deck with notes that are tagged abc. User B now imports this deck, but has abc configured as private. Will the tag be imported or not? And if it is imported, is immediately recognized as private in B's collection, which would make passing it on to user C more complicated?

It's a good point, and having a hard-coded list of private tags would allow us to simplify the UI. But if deck authors want asymmetric handling of private tags (eg excluding some on import that they'd like included on export), I'm not sure such an approach would work (unless they renamed the tag prior to import/export, which is impractical). I presume that's what you are doing with the MissedQ tag @AnKingMed?

Private vs public export:

Note that the special fields add-on is just for importing.

I think we may have had our wires crossed here. I was asking whether a single "for public/for private" checkbox could work on deck export. All the options you're describing seem to be related to importing, which is a separate matter.

AnKingMed commented 2 years ago

@dae the MissedQ tag is just my personal tag that I use when I miss a practice question and decide a flashcard is relevant to that.

Sorry I didn't understand this was just exporting, however it's probably still worth considering the whole picture when addressing exports and imports.

I think the most ideal way to handle exporting tags is to select exactly which tags you want to export. For example, a dialog that has "select all" or "select none" and then you can choose

dae commented 2 years ago

I'd prefer to avoid a list of all tags if possible, as it adds a fair bit of extra implementation complexity.

I'm still trying to understand the different use cases that are covered here. Does the following sound correct? Are there other cases I've missed?

1) There are some tags you'll always, or almost always want excluded 2) Sometimes you want to share a single tag with others. By that, I presume you mean a) "only notes with that tag", and b) "preserve any other tags the notes happen to have when importing"?

Could 2b) perhaps be handled when importing, instead of when exporting? Eg if the importer allowed you to only import specific tags? You could accomplish 2a) by searching in the browse screen, then exporting from there.

AnKingMed commented 2 years ago
  1. true
  2. I mean a) only notes with that tag, but I only want to export the notes with that tag and I want all other tags to be stripped (note that if I choose a head tag, I want it to include all subtags and notes with subtags as well)

2b is definitely an importing issue.

There are also instances, however, where I want to export multiple tags. The solution right now would be to export a single tag multiple times and then import each file into the new profile. I think the list of tags is ideal if at all possible if you're going for the best solution

dae commented 2 years ago

Does the single-tag filtered need to be done on the exporting side though? Couldn't the filtering be done on the importing side, if the importing screen provided a way to include or exclude specific tags?

AnKingMed commented 2 years ago

I think that's much more appropriate on the exporting side. I generally share tags with others and I only want to share certain ones. It would be an extra step to have to share everything and then tell them what to do

RumovZ commented 2 years ago

For 99% of people, they need just a couple functions, the other functions are for the deck creators and maintainers.

Is the goal really to support all the use cases mentioned here? Call me Bernie Sanders, but I think we should focus on the 99% who'd have a tough time figuring out the correct settings for their use case. Also, I think apkg is the wrong format to solve these problems. As it's simply a collection, it can't represent diffs and has to rely on external information to work as a a half-decent update. This places a burden not only on the exporter (e.g. deck author), but also on the importer to get a lot of settings right. Instead, I propose to implement these things with the new anki-json format. JSON is self-describing, so all the relevant settings could be set on the exporter side, requiring the end user only to click import. It truly can represent a deck update, because missing data is filled in with Anki defaults or existing data. No need to ship a whole new deck just to update a single tag. And lastly, it's plaintext, meaning you can version-control it.

Changing Notetypes

When importing .apkg files, the import dialog provides a dropdown "On schema change", with the options ["Preserve", "Update (full sync)"].

I think you're right that embedding a change-notetype feature is not the right way, @dae. It would be quite powerful, but easy to mess up and would still not solve all the problems.

When we encounter a note that we'd like to update where the notetype has a remapped id, we assign the new notetype id to the note, and update its fields.

This would mean that subsequent updates would continue to cause conflicts. Let's say you have a notetype with id 1 and schema A and the import contains one with also id 1, but schema B. Then a notetype with schema B and id 2 is added to your collection. Then you do another import with the same notetype, but there's still no notetype with matching id and schema, so B3 is added.

What about merging the notetypes? The new notetype would contain the union of fields and card types. The user may decide if card types with the same name, but different templates would be updated or preserved. In case the imported notetype only contains a subset of fields and card types with the same templates (which should be the case on subsequent updates), this would not need to be treated as a conflict and it could be trivially mapped into the existing notetype.

Private vs. Public

@hgiesel's proposals sound good to me.

But if deck authors want asymmetric handling of private tags (eg excluding some on import that they'd like included on export)

If there's only one toggle for importing/exporting private fields or not, it seems reasonable to me to treat private tags the same.

AnKingMed commented 2 years ago

Is the goal really to support all the use cases mentioned here? Call me Bernie Sanders, but I think we should focus on the 99% who'd have a tough time figuring out the correct settings for their use case.

I disagree with this. The 99% strongly rely on the 1% making high quality decks in these instances and focusing energy on high quality deck making utilities is important.

dae commented 2 years ago

You both have valid points. Deck authors do bring a lot of value to the ecosystem, so we can't ignore their needs. At the same time, there is a higher bar for entry into the core code than in add-ons, and our goal here is to identify the important features and try to add them in a way that is maintainable, and fits in with the rest of the UI (the current special fields add-on UI is not particularly understandable).

@AnKingMed could you give us some "user stories" about why you're exporting individual tags, so we better understand your workflow? What sort of information do these individual tags represent, and why do you need to share them with others? Would an "only add, don't replace tags" option on import work in some/all of such cases?

@RumovZ re using JSON for this: that is a good point, and JSON may well be more appropriate here. With colpkg/apkg/csv importing mostly done at this point, and fallback to the old importers with the new setting enabled, I feel like we're in a good place now. So long as we continue to support the legacy codepaths, there's no need to rush to add these features to our apkg importer, and perhaps it's worth pursuing the JSON importer further first.

While features focusing on shuffling data between collaborating deck authors may make more sense for the JSON exporter, some of the other proposals like private fields and tags probably still make sense for the apkg exporter, as they would be useful in the more common case of deck author distributing changes to end users. But presumably in that use case, we don't need to worry about things like single-tag imports.

This would mean that subsequent updates would continue to cause conflicts. Let's say you have a notetype with id 1 and schema A and the import contains one with also id 1, but schema B. Then a notetype with schema B and id 2 is added to your collection. Then you do another import with the same notetype, but there's still no notetype with matching id and schema, so B3 is added.

Yep, good point. The way the original import code addressed this case is to add 1 to the old notetype id when adding a new one, and keep doing so until it found a free id or found a notetype with a matching schema.

What about merging the notetypes? The new notetype would contain the union of fields and card types.

That feels more complex than the approach above - WDYT?

Re: private tags, a few thoughts:

RumovZ commented 2 years ago

I'm not saying we should ignore deck creators, just address their needs in some way that doesn't aggravate the experience of the average user. E.g. by putting highly specialised features in an add-on or a part of Anki, that's only for advanced usage anyway (JSON export).

Changing Notetypes

That feels more complex than the approach above - WDYT?

Sure, but it's also much more useful. Say you add a private field to a notetype from a shared deck. Then you receive an update for its notes. With Preserve you'd miss out on the updates, with Update you'd lose your private field data. Merge would offer the best of the two worlds. The user would only need to tell Anki whether to update the card templates.

Private Tags

A separate node would better fit the hereditary nature of marking tags as private, but it would introduce the concept of meta tags. No idea what would be less complex to implement in the end. I'd prefer adding a label to each private tag, too, but to avoid clutter, maybe an alternative icon for private tags would be more appropriate. As for sorting, my understanding was that @hgiesel wanted to strip the prefix/parent beforehand.

And with a prefix match, either '#foo' or '#::foo' would be valid, so users could nest them if they wanted to.

Hmm, then # as a tag would get special-cased? That would seem to add the complexity of a separate node as well. As I understand, the idea is to map prefixed tags to flagged tags in the GUI (#foofoo (private)). This might get more cumbersome if there were two different ways to map foo (private): #foo or #::foo.

Now that I think about it I see some more issues with both a prefix and a parent tag:

Maybe adding a dedicated flag to the tag config would be better after all?

AnKingMed commented 2 years ago

@AnKingMed could you give us some "user stories" about why you're exporting individual tags, so we better understand your workflow? What sort of information do these individual tags represent, and why do you need to share them with others? Would an "only add, don't replace tags" option on import work in some/all of such cases?

With these big premade decks, often one student will go through a lecture and tag all relevant notes. They can then share that individual tag and the other user can import that individual tag. User #2 may, however, have changed some of the premade deck tags and importing all of User #1's tags into their database would either make it really messy or overwrite all of that

dae commented 2 years ago

Sure, but it's also much more useful. Say you add a private field to a notetype from a shared deck. Then you receive an update for its notes. With Preserve you'd miss out on the updates, with Update you'd lose your private field data. Merge would offer the best of the two worlds. The user would only need to tell Anki whether to update the card templates.

Good point about the private field case. Would this work in other cases like renames though? For example, a deck author publishes a deck with "field 1" and "field 2", which the user imports. Then they rename the fields to "field a" and "field b", and reshare the deck with a modified template. If we were to take a union of the notetypes, you'd end up with two copies of all the field content, and two templates, one showing field 1/2 and one showing field a/b. I suspect this is the most common workflow, where users consume shared decks and want to fetch new versions from the deck author, without maintaining any local changes.

You can basically have the same tag twice in your collection. Say you have both #foo and foo in your collection. What happens if you import a deck with foo? Is this tag private or not?

My thinking is that #foo and foo are two distinct tags, just as foo and Foo are two distinct symbols/variables in most programming languages. The Go language uses case to denote public/private methods, so a class can have a private "foo()" and a public "Foo()" at the same time if it wants. All our code would have to do is check if a tag started with a # (or optionally, its ancestor) to decide if it's private. The browser could optionally add a (private) label to such tags to make the meaning of the symbol more clear, but I do not think it should hide the # symbol, as that would be confusing if there also happened to be an unprefixed tag of the same name.

Toggling privateness of a tag would need to modify all notes having that tag.

True, but I'm not sure that's a big deal for tags that the user would permanently want private. It also might play better with mtime-based update checks - consider the case where a note with mtime 123 and tag foo is exported, and imported into a user's collection. If it had been marked as private separately, and the user toggled it to public and re-exported, the importing side would not see any mtime change.

For ephemeral exclusions it would be awkward, but perhaps we could handle such cases a different way?

RumovZ commented 2 years ago

Would this work in other cases like renames though?

No, it's meant as an alternative beside Preserve and Update, or possibly as a replacement for Preserve.

but I do not think it should hide the # symbol, as that would be confusing if there also happened to be an unprefixed tag of the same name.

Yes, the suggestion to have the GUI strip the prefix seemed reasonable to me at first, but I've realised the problems you describe. However, wasn't the idea to not only skip private tags when exporting, but also when importing? This at least seems impractical with the prefix approach for the reason I've mentioned above.

If it had been marked as private separately, and the user toggled it to public and re-exported, the importing side would not see any mtime change.

I gather the problem is that the tags table lacks an mtime column? Sorry for diverging, but as tags have continued to become more stand-alone and powerful, have you ever considered giving them ids (and mtimes)? For one thing, that would make renaming and reordering them a much smoother experience.

dae commented 2 years ago

wasn't the idea to not only skip private tags when exporting, but also when importing? This at least seems impractical with the prefix approach for the reason I've mentioned above.

Sorry, I couldn't locate this - could you quote or restate the issue here? Wouldn't it be as simple as the following?

existing_note.tags = remove_all_except_private(existing_note.tags)
for tag in incoming_note.tags:
  if not has_private_char(tag):
     existing_note.tags.add(tag)

I gather the problem is that the tags table lacks an mtime column?

maybe_update_note() compares note modification times. If privateness were a property stored in the tags table, toggling it would not update the mtime on notes that use the tag, and thus the change in available tags on import would not be known to the importer, as the mtime would still match.

Sorry for diverging, but as tags have continued to become more stand-alone and powerful, have you ever considered giving them ids (and mtimes)?

Very old Anki versions stored the tags in a separate table, instead of embedding them in a note. I seem to recall it not performing well at the time, as the tags table could grow into the millions of entries. That may have been poor table design though; I do not recall the details off the top of my head.

Recently I've been working a bit on the AnkiDroid code to see if I can give them a bit of a push towards moving to the new backend. Until they've fully bought into it and can update to new versions promptly, we're somewhat limited in the breaking changes we can make, as I don't want clients to have to go through expensive upgrade/downgrade steps just to talk to each other.

RumovZ commented 2 years ago

Sorry, I couldn't locate this - could you quote or restate the issue here?

My bad! I was referring to the first answer here, but upon reading it again, I realise it doesn't say what I thought it did. I'm still not sure I understand the problem with making privateness a flag. Do you mean the following scenario?

  1. User exports note with tag foo and mtime 123.
  2. User makes foo private.
  3. User reimports note (privately).

Wouldn't we just update the tags separately from the notes, like we currently do with most other objects?

Until they've fully bought into it and can update to new versions promptly, we're somewhat limited in the breaking changes we can make, as I don't want clients to have to go through expensive upgrade/downgrade steps just to talk to each other.

I understand, but do you think this may happen in the long term? I wonder if I should think of tags more like full Anki objects or rather plain strings with some additional related metadata.

dae commented 2 years ago

I understand, but do you think this may happen in the long term?

It will require rewriting the majority of our tag handling code, so the gains we get from doing so would need to be fairly large to justify the effort. Operations that modify the text of tags are the hierarchy would become faster, but accessing the tags of a note would require looking them up in the DB and potentially caching the results. The syncing protocol would need adjusting, and importing would need to deal with the tags as strings, as the tag ids in the source and target decks may not match.

I wonder whether it's worth the effort?

RumovZ commented 2 years ago

Thanks for explaining, I finally understand. This would indeed make updating a bit more cumbersome. Don't we have the same problem with private fields? There the mtime bump happens in the notetypes table.

Currently, an exported note is an exact clone of its source note, so it rightfully has the same mtime. With privateness, we create a divergent ad-hoc note, so the mtime may or may not make sense. Maybe we need to invariably bump the mtime on partially exported notes?

dae commented 2 years ago

You're right, it would be a problem there too.

Bumping the mtime each time could cause newer changes to be overwritten with older ones. One alternative I can think of is to check for source.mtime >= target.mtime instead of source.mtime > target.mtime, ie updating in the matching mtime case. Ideally we'd only add them to the list of updated notes if the tag or field content had actually changed, so we'd need to compare the two in the identical case before applying/counting the changes.

RumovZ commented 2 years ago

Right. I think we should check for identity in the source.mtime > target.mtime case as well, because the newer mtime may result from changes to private fields alone. In which case we also might overwrite newer changes with older ones, by the way.

I think this once more shows the inadequacy of using apkgs for updates. Would be great if deck authors could generate a JSON file only containing the diff between their originally shipped deck and current collection.

dae commented 2 years ago

I suspect JSON imports will improve on some issues and introduce others, and diff-based updates rely on the target collection having the same or similar base point as the collection that was used to generate the diff.

In any case, it sounds like you're more enthusiastic about the JSON side of things at the moment, and we don't currently have a fully-fleshed-out implementation to compare to the existing functionality of that add-on. So maybe it would be worth investing some more time on that side of things first, and then circling back to the apkg side of things later?

RumovZ commented 2 years ago

Sounds good!

and diff-based updates rely on the target collection having the same or similar base point as the collection that was used to generate the diff.

I imagine deck authors would share an apkg initially and then offer JSON-based updates. These would be much smaller in size and less likely to conflict with any changes endusers have made. If you tried to import an update without the base deck, probably nothing would happen, as all the referenced ids would resolve to nothing. For this workflow to work, we'd need a way to include media, probably reusing the archive tooling for apkgs. This reminds me, we should also add some meta info with a version number we can bump when making breaking changes.

Does this sound like a direction in which you'd like to go? (Of course, I'd try to find some low-hanging fruit first and leave media and so on for a later iteration.)

dae commented 2 years ago

Agreed on dealing with the media later; one possibility would be to allow a .ankijson file inside an .apkg container instead of an .anki2 (we'd need to decide whether that would be better than a separate extension or not)

Deck authors tend to both add new cards and modify existing ones, as well as sometimes modifying the notetype structure or changing notes to a different notetype, so ideally we will handle all of those cases

Does this sound like a direction in which you'd like to go?

Sounding good. Apologies I haven't been as involved recently; I've been diving into the AnkiDroid code lately, and have also had other non-Anki related things going on.

RumovZ commented 2 years ago

Repurposing apkgs is an interesting idea. I myself was taking a bit of a break and didn't perceive any lack of involvement. And I'm very excited to see AnkiDroid's backend integration coming along!