Hanzi Graph Improvements

Nick3C commented 14 years ago

1) split the area of the graph into the 4 different hanzi hsk levels in the same way "due" splits young and mature cards.

2) have a cropped axis (so that the graph looks sensible when the numbers are large (perhaps highest number minus 10 character (or "0" if the number would otherwise be negative). This must be done or long-term users (like me) get a block and can't see the relatively small increase at larger scales.

3) correct the x axis which shows: today+1 for "7 days" today+5 for 1 month today+20 for 3 months today+50 for 6 months and beyond [note: these figures differ with different decks as my vocab deck has much longer times] (it should only show today as it is a backward-looking graph (even tomorrow is irrelevant [and certainly is not 0])

4) deal with a bug that causes cards to be shown before the creation of the deck.My vocab shows characters 1500 days ago (about 1100). I suspect this is caused by the randomisation algorithm in Anki which causes cards to be made due in the past. Obvious solution is to stop looking for new cards when we reach creation date of deck (stored value) and show 0 before this.

Nick3C commented 14 years ago

5) With the new version that picks up Mandarin the graph shows cards with this tag, not fact (I have 27,000 cards which makes my graph a bit rubbish :p ) [1,000 incorrect before deck creation, then straight up to 27,000 at -130 days [which is wrong by the way!])

batterseapower commented 14 years ago

I actually implemented 1) before I read this. Check the fork :-)

2) This SHOULD be happening already.... but there was a bug. Fixed.

3) As 2). Fixed.

4) Hmmmmmmmmmmm. Have you got a small example deck?

5) OK, hang about: so the problem is that the Mandarin tag is checked against the card tag rather than the model tag?

Nick3C commented 14 years ago

1-3) hahaha, great minds think alike :)

3) your new fixes are a big imporvement, but you appear to be out by a day in relation to the left side of the x-axis. For example 7 days shows 6 then -7 is blank. Same is true for other graphs.

4) not as yet but I can send a screen-shot if that is any good (sharing my big deck is a nightmare) it may be linked to 5... let's see...

5) I guess so...

Nick3C commented 14 years ago

6) almost all of my cards show as non-hsk. I rather suspect that the problem is that all hanzi are being counted for the graphs and not unique hanzi [as even if cards were being scanned only the earliest one should be placed on the graph]. I think the uniqueness check is only applied for the count towards hsk (which is why the numbers are so low)

Nick3C commented 14 years ago

just about 2500 are shown as having hsk level (which is the aprox number of unique hanzi in my deck)

Nick3C commented 14 years ago

7) I am not convinced by the colours yet (but as the size means I can't see them I withhold my judgement) :)

8) Suggest shortening the label to Basic, Inter., Ele., Adv.., non-HSK (and putting them in that order). Graph title can be changed to "Unique Hanzi (Cumulative, by HSK Level)" [I know it only shows initiated ones but I think it is fine to skip the leaned.

batterseapower commented 14 years ago

3) Fixed

4) I don't understand this problem. The code already filters out cards which have been repeated <= 0 times. Maybe your deck is a copy of someone elses who already repeated the card,s and anki didn't blank them?

5) As above. I think I'm only including cards in a Mandarin model as expected

6) I don't think this is due to uniqueness, but rather a bug in my color labelling. Try again now.

7) Yes, I'm not sure about them either. Quite hard to distinguish.

8) Done.

Nick3C commented 14 years ago

2) This works now, but the numbers on the y-axis are wrong and indicate that the [minimum minus 10] is 0.

4) No, my deck is all self-generated (although I have imported chunks it has come from plain text files).

The answer to this problem is that I have a model that is interfering. I am not quite sure yet. I went through my models deleting one at a time until I found the cause. The model is the one containing the 214 chinese radicals. Once this is deleted the deck is fine. I found that deleting the other models made no difference, nor did deleting the card templates in my main vocab model.

I tried exporting the facts and importing them into a new deck but couldn't reproduce the problem. So... I was forced to do the ultimately evil task of sifting through the whole deck to find out... this was a bit time consuming but it is done now. iT IS SOME

I have managed to isolate many cards that are acting crazily. For example, search for: 中央电视台 and delete it (hanzi on graph reduces from 1200 to 1185 (even though the card has only 4 hanzi on it). MANY of the radicals do similar things. The non-hsk characters seem to be counted many time. Deleting about 30 cards reduces the character count by several hundred.

In my vocab deck many more cards are like this.

The deck passes a database scan.

Not sure about the other problem with early dates yet. Check you email for the deck.

Nick3C commented 14 years ago

I experimented with colours and tried a blue/red gradient instead of red into black. The 5 stages work a bit better. I'm not sure about this one either but achieves a similar effect to the other colours but still possible to distinguish each layer. (pushed to git)

batterseapower commented 14 years ago

2) The minimum -10 is capped at 0 and so I think this is correct behaviour. I don't really want to show negative card numbers.

4) Thanks for the example. I'll take a look.

batterseapower commented 14 years ago

Hmm, well I think I can see part of the problem. The radical cards contain (in the Examples field) almost all of the characters they form a part of, and the HanziGraph code looks in any field for Hanzi to consider 'learnt'. (This is copied from the design of the KanjiGraph, BTW). You probably expect that only the Expression field is considered, however...

Not sure what's going on with characters appearing in the distant past, though!

batterseapower commented 14 years ago

Well - give my fork a go now. I've restricted it to just values in Expression fields, and now your example deck reports only 27 or so unique Hanzi, of which half are HSK.

batterseapower commented 14 years ago

Like the new colours by the way :-)

Nick3C commented 14 years ago

oops, that was seriously dumb of me. I had just presumed it would only be expression... forgot about all the extra characters there.... sorry to make such a fuss and be wrong :$

Ah, axis is working perfectly now too...

Not sure about those characters in the past. There is like 1200 of them (I can't see any of the basic level hsk in this graph (I switched to another deck to look at it). No way to search for them using the editor... I will see what I can manage with sqllite...

I will look into the characters in the past.

Nick3C commented 14 years ago

Hmm, any idea how these strings are converted into dates? eg first answered: 1222425551.359

batterseapower commented 14 years ago

Yes: the units are seconds. Probably it's the number of seconds since the Unix epoch (1970). You can work out how long ago it was by subtracting it from the time now (e.g. 1244500161.9510961 is 23:30 on 8th June 2009 - look at time.time() in the Python prompt) and dividing by 86400 (the number of seconds in one day).

Nick3C commented 14 years ago

I have asked on the forum to see if Damien has a comment. I think it is an issue with Anki. The graph is working fine (and works fine with my other decks, or if I remove material). The only issue is why Anki has recorded ancient dates in the deck (which is an anki issue not a PyKit one). closing issue.

Nick3C commented 14 years ago

I can't work out the problem with this. I used sql browser to find the earliest card. looked it up in anki the date was fine.

Can we fudge this by not looking any further back than the age of the deck, please?

batterseapower commented 14 years ago

Well, we CAN. However, the broader issue is that this might by symptomatic of cards being attributed dates earlier than their "real" ages more generally, which would make the stats completely wrong. In that case, dropping cards older than the deck would just mask the problem without really addressing it.

I assume you haven't been able to build a deck reproducing this behaviour, then?

Nick3C commented 14 years ago

Ok, I've managed to isolate two separate bugs with this.

The first (the old-age bug) seems to be an Anki bug. For some reason the first-answered data is being lost after a card has been answered, setting the value to zero. It could possibly be something to do with the website (but who knows). Anyway, as you explained, setting this to zero means it is treated as zero relative to a certain date (be it the unix era or whatever). this zero date is in the distant past thus the cards appear to have been answered in the distant past.

I will raise this with Damien but we are never going to be able to correct this and get the missing data back. This means we need to work around it. Now that we know what the problem is we can work around it. It is not many erros, but simply the result of data loss. We can correct for it by replace "0" with the creation date of the deck before passing the data to the graph plotter (this is much better than wiping it out (because we lose the bottom when the data is scaled in to the distant past). I know this makes the graph imperfect but the deck data is imperfect and this is the best we can do with imperfect data.

Second bug: wrong number of facts reported on graph. If you look at the fact browser you will see there is 4 and 1 cards in the two decks, respectively. However these are reported on the graph as 8 and 7 respectively. I checked the SQL and it reported the correct number of facts and cards, which suggests this is our bug, not Anki's :)

This was pretty tricky to nail down. Sorry it took a while :)

Nick3C commented 14 years ago

Ahhhh, I have found the bug. When you use the reschedule card feature it incriments the number of times answered but doesn't record a first answered date if none is present. The result is cards that lack this data. If you later answer a card (when it becomes due) it will have this data added. In the interim it has no first answer data and so is treated as being answered at date = 0. I will report the bug to Damien for correctly.

My proposed solution will correct this perfectly in the interim (and the card will get new first answered data, albeit, slightly wrong) when it is next reviewed.

phew!

batterseapower commented 14 years ago

Presumably, rescheduled but not answered cards are not "learned", so it would be better to simply exclude things with 0 as the first answered time?

Nick3C commented 14 years ago

No, there are two ways to reschedule: 1) reschedule as new (it is as you say but the card is still marked as new so this problem doesn't exist anyway) 2) reschedule between two dates in the future (i.e. push ahead when you know them really well and want to skip a slow build-up of initial interval).

I used feature 2 on a whole load of imported stuff (from non-SRS flashcard software). I knew it already and didn't want to slowly build it up over time so I pushed the interval between half an year and a year. It is thus marked as not-new and in every other way treated as a review, but no first-answered time is added which is a bug. This is the main purpose of the feature so such materials are, necessarily known (or the user is mis-using it, but we can't budget for that :p ).

I have alerted Damien to the issue and he has replied and seems inclined towards fixing it for the next release.

Nick3C commented 14 years ago

Oh, I'll tell you what, how about we use the "date-added" for any card with a zero value. That should be much closed to the real value than the deck creation time.

Nick3C commented 14 years ago

ok, it's now fixed in dev: http://code.google.com/p/anki/issues/detail?id=1104

Gosh, he's so efficient :)

batterseapower commented 14 years ago

I've added a change in my fork to exclude such cards.

hanzigraph8cards's graph shows 2 cards answered as expected:

sqlite> SELECT fieldModels.name, fields.value, cards.firstAnswered from cards, fields, fieldModels WHERE cards.factId = fields.factId and fields.fieldModelId = fieldModels.id and fieldModels.name = "Chinese"; Chinese|往右拐|0.0 Chinese|材料|1225281937.796 Chinese|晴天|0.0 Chinese|演|0.0

bug-1cardsas8's graph shows 5 cards, again as expected because CCTV's name is 5 characters long.

batterseapower commented 14 years ago

OK, I made the change before I saw your comments above. Will need to modify it to use the created time instead.

Nick3C commented 14 years ago

bug-1cardsas8's graph shows 5 cards, again as expected because CCTV's name is 5 characters long.

ooops, sometimes I hunt so hard for bugs I can even catch imaginary ones :)

batterseapower commented 14 years ago

Good bug hunting :-)

I've pushed the changed fix to my fork.

Nick3C commented 14 years ago

Great, working now. I have another improvement to make the graph work better. At the moment it works perfectly for "small graphs" where we want to cut the axis (otherwise the proportional differences are too small to be noticable on the scale). At some point, however, we want to switch the graph to see the whole picture. In my graph (1800 cards) I can't see more than a tiny part of the basic hsk cards (so I can't see the pattern at all).

We basically want two types of graphs, local micro-scale ones for small changes in card numbers (relative to size of decks) and macro-scale graphs where the change is large.

I want to suggest we use a ratio-based approach: 100 * (TopPlotFigure - BottomPlotFigure) / TopPlotFigure

I did some modelling on this: [cut down version] time top bottom gap percentage
7 days 1854 1825 29 1.56%
1 month 1854 1812 42 2.27%
3 months 1854 1700 154 8.31%
1 year 1854 780 1074 57.93%
etc

The first 3 are perfect but the 4th (1 year) isn't right. It cuts off too much of the scale when it should be showing an overall trend.

Thus we can estimate that a figure of 40% is probably the figure at which we want to display the entire graph data instead of a BottomPlotFigure minus 10.

We can treat the exact number later if necessary but this will make the graph quite a lot more useful. Doable?

Nick3C commented 14 years ago

This is beautiful by the way. Works perfectly and looks really good.

Nick3C commented 14 years ago

Oh, the resize was working last night but isn't doing this anymore... what happened?

batterseapower commented 14 years ago

Well, I haven't implemented your suggestion above yet, so I'm not too suprised that it doesn't work :-)

I'm not really sure why you want the behaviour you outline above given that you can just increase the scale to e.g. 1 year in order to see the entire change over time. Or perhaps I've misunderstood what you are after.

Resize still seems to work as expected for me (shows the trend in the time range you have selected).

Nick3C commented 14 years ago

Oh, could have sworn it was working yesterday. Funny. Perhaps I was looking at another deck.

Anyway, the reason it is important is that if you have many characters, as I do, then the graph is capped. I can't see anything below aroun 790 cards no matter what scale I go to. This means that I can't see the growth of basic level cards at all. The cap is caused by showing 10 less than the lowest number, but because my first cards were imported (and I guess I imported around 800 characters originally) this means that data is totally inaccessible to me.

Come on, I wouldn't suggest it without a reason :)

batterseapower commented 14 years ago

I'm sure you have a reason for suggesting it, but I still don't understand what it is :-)

Are you saying you added cards more than 5 years ago? Because if you change the scale to 5 years you should certainly be able to see the whole trend (because 5 years ago you would have known 0 cards). If 5 years is not long enough, perhaps there should be longer periods available in the combobox?

Nick3C commented 14 years ago

my deck is only 9 months old. check out the screen shot I just sent you and you'll see what I mean. It's set to 5years but you don't see the whole graph because the first action was an import (jump from 0 to 780 in one day). Thus is cuts the graph at at 780 (10 less than the import).

batterseapower commented 14 years ago

Ahhhhhh, NOW I understand :-). Thanks.

So the fix is to include "0" in the minimum/maximum calculations when the graph axis goes back further than the earliest thing answered, rather than just the first day when something was answered.

I'm glad you have all these weird decks to stress test the features with :-)

Nick3C commented 14 years ago

lol. I'm fairly sure I submit 75% of the Anki bug reports. Now you can see why :)

yes, the fix is to deal with that kind of situation, but it actualy works as a "zoom" feature on decks. We are not accounting for the import itself, but for the effect. eg several imports near each other or a massive cram over a week will also be covered.

batterseapower / pinyin-toolkit

Hanzi Graph Improvements #48