First letter of words in user models are missing; feature_count_reduced seems incorrect

Joeran commented 10 years ago

für die empfehlung mit user_models.id=178687 ist stop_word_removal=0 und das user model sieht wie folgt aus

dcr_doc_id_207760 dcr_doc_id_3972 dcr_doc_id_2785643 dcr_doc_id_4301209 dcr_doc_id_624872 dcr_doc_id_1616038 dcr_doc_id_4301227 dcr_doc_id_4301589 dcr_doc_id_17387 dcr_doc_id_4193866 dcr_doc_id_354100 dcr_doc_id_527590 dcr_doc_id_558288 dcr_doc_id_4301197 dcr_doc_id_839720 dcr_doc_id_12140 dcr_doc_id_3167241 dcr_doc_id_5733078 dcr_doc_id_712701 dcr_doc_id_1167876 dcr_doc_id_619080 dcr_doc_id_516771 dcr_doc_id_1365590 dcr_doc_id_1885504 dcr_doc_id_4301470 dcr_doc_id_4301216 dcr_doc_id_4300680 dcr_doc_id_777228 dcr_doc_id_2743977 dcr_doc_id_1956050 dcr_doc_id_1025 dcr_doc_id_4300538 dcr_doc_id_4301597 dcr_doc_id_4193869 dcr_doc_id_2474771 dcr_doc_id_4300626 dcr_doc_id_4300984 dcr_doc_id_2967827 dcr_doc_id_521456 dcr_doc_id_3454222 dcr_doc_id_9476566 dcr_doc_id_3489831 dcr_doc_id_1119 for the and research based recommendation recommender systems user paper papers citation system ecommendation with ecommending filtering collaborative recommendations peer digital web framework information ser approach ntological profiling retrieval review analysis related aper document recommending citations independent distance scientific library earning itation ontext ware etwork igging ocial riendship etworks model libraries mapping mind source measure good using multi support what survey tagging models impact measures tag you ollaborative from generation building reading urvey modeling content cientific ersonalized ystem into daptive nvironments semantic articles between methods scholarly usage mechanism papits etrieval nformation nhancing cademic aware esearch data concept tree automatic large scale application search similarity google access records indexing ranking personalized profile study lists utomatically apping ind oftware work top ast ibliographic use avoiding look on't pitfalls stupid when elated imilarity odel ontent opic robabilistic rticles utomatic references translating translation service ealizing lement ommunication omprehensive ore rofile ystems novel algorithm find clustering hybrid any art state comparison social random graphs multiple criteria ntroduction profiles journals classification networks compendium issues personalised ech ens irection nalysis performance can tags journal apers itations knowledge edit problems book science eer ite becomes mahout mendeley researcher article inding predictive cience valuating can’t structure user's cited introduction strategies against

folgendes ist dabei merkwürdig

wenn man sich das user model anschaut, dann sehen die meisten der eingetragenden wörter sehr merkwürdig aus (z.b. "etwork igging ocial riendship etworks"). es scheint sehr häufig der erste buchstabe zu fehlen. ich könnte mir vorstellen, dass das mit algorithms.node_info_source zusammenhängt, das in diesem Fall=6 ist (references & pdf titles). kann es sein, dass entweder die references oder die PDF Titel falsch ausgelesen werden und jeweils der erste Buchstabe fehlt?
in dem beispiel ist user_models.feature_count_expanded=1207 und feature_count_reduced=1202. Dabei sollte gar kein stop_word removal angewandt werden. Die Frage ist, wie kommt die Differenz von 5 zu Stande? Wurden Stop Words entfernt obwohl sie nicht entfernt werden sollten (wenn ja, warum stehen sie trotzdem im user model drin?). Wurden andere features entfernt? Könnte es vielleicht hieran liegen, dass die Klickraten für Stop_Word Removal On/Off in der Vergangenheit widersprüchlich waren? Idealerweise sollte feature_count_reduced NULL sein, wenn stop words nicht entfernt werden. Oder habe ich was übersehen und wir entfernen noch ander Features?

unabhängig davon ist merkwürdig, dass referenzen und wörter genommen werden, obwohl node_info_source=6. in dem fall hätten eigentlich nur wörter genommen werden sollen (siehe #85 ). Liegen vielleicht an #85 die ganzen probleme?

stlanger commented 10 years ago

zu 2.

        origCounter = new DocearTermCounter(matchVersion, stream);
        stream = origCounter;

        if (stopwordRemoval) {
            stream = new StopFilter(matchVersion, stream, stopwords, true);
            stream = new StopFilter(matchVersion, stream, loadStopwords(getClass().getResourceAsStream("/germanStopwords.txt")), true);
            stream = new StopFilter(matchVersion, stream, loadStopwords(getClass().getResourceAsStream("/stopwords.txt")), true);
        }

        if (stemming) {
            // stream = new StemmingFilter(matchVersion, stream, true);
        }

        stream = new DocearFilter(matchVersion, stream, true);

        reducedCounter = new DocearTermCounter(matchVersion, stream); 
        stream = reducedCounter;

d.h. befor der "reduced counter" berechnet wird, läuft der DocearFilter durch und verändert das model.

            if(ignoreCase) {
                term = term.toLowerCase();
            }

            // number only filter
            if(numberOnlyPattern.matcher(term).find()) {
                return false;
            }

ignoreCase ist immer true - d.h. groß-geschriebene Worte und kleine Worte werden nun vermischt
reine Zahlen werden heraus gefiltert mit private final Pattern numberOnlyPattern = Pattern.compile("^[\\d\\.,]+$");

Leider lässt sich im Nachhinein nicht sagen, was genau gefiltert wurde, aber es wird jedenfalls gefiltert

stlanger commented 10 years ago

zu 1.: ich war mir sicher, dass wir die diskussion schon mal hatten, wusste aber nicht mehr, wie sie ausgegangen war - ich konnte im code nix finden, aber eben hatte ich die zündende idee: Du hast in einigen deiner Maps "Mist" drin:

z.B.:

literature_and_annotations.mm:<node TEXT="In these s ystems, users explicitly express t heir preferences by giving either [...]

BeelGroup / Docear-API

First letter of words in user models are missing; feature_count_reduced seems incorrect #88