is feature weighting really correct? (tf-idf, tf only, ...)

Joeran commented 10 years ago

This chart shows CTRs and runtimes for weighting terms and citations with TF, TF-IDF (based on the PDF corpus), and TF-IDF (based on user's mind-maps). All data is since August 2013 and later.

We see that average runtime of TF is higher than of TF-IDF (Corpus). That is not plausible. For both, TF and TF-IDF you need to calculate the term frequency TF. Only for TF-IDF you additionally calculate IDF. This means, calculating TF-IDF must require more time than calculating TF only.
TF-IDF (MM) was never applied to citations since around mid of 2013. Why? (before this can be answered, probably #78 should be fixed)

stlanger commented 10 years ago

select A.weighting_scheme, avg(S.computation_time), avg(U.execution_time) from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2014-01-01' AND A.approach<>2
GROUP BY A.weighting_scheme

stlanger commented 10 years ago

very strange: lucene query time is much worse for TF-only user models if calculated since 2013-08-01:

select A.weighting_scheme, round(avg(S.computation_time)) AS lucene_query_time, round(avg(U.execution_time)) AS model_creation_time from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2013-08-01' AND A.approach<>2
GROUP BY A.weighting_scheme

stlanger commented 10 years ago

for Term-only recommendations user model creation differs more:

stlanger commented 10 years ago

update!: just a coincident:

without feature boosting time for models with the same size, time is about equal:

select A.weighting_scheme, round(avg(S.computation_time)) AS lucene_query_time, round(avg(U.execution_time)) AS model_creation_time from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2013-08-01' AND A.approach<>2
AND A.data_element_type=1
AND A.data_element_type_weighting='1'
AND A.feature_weight_submission=0
AND U.feature_count_reduced_unique between 590 AND 600
AND (A.default_algorithm <> 1 OR A.default_algorithm IS NULL)
GROUP BY A.weighting_scheme

with boosting there is a huge difference:

stlanger commented 10 years ago

:) :) difference between lucene query time of TF vs TF-IDF generated terms is simply that TF favored terms are generally found in many more documents than terms favored by TF-IDF

--> with TF much more results need to be rated and merged by lucene for every term

stlanger commented 10 years ago

regarding 2nd question:

created recommendations for MM with TF (weighting_scheme 1) vs TF-IDF (weighting_scheme 2)

select A.weighting_scheme, count(*) AS count
from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (U.algorithm_id = A.id)
WHERE S.created BETWEEN '2013-04-01' AND '2013-05-01'
AND A.data_element=1
AND A.default_algorithm IS NULL
GROUP BY A.weighting_scheme

Apr 2013

Aug 2013

Oct 2013

Feb 2014

Dec 2013

Joeran commented 10 years ago

ich glaube, da haben wir uns missverstanden. probiere mal das hier:

select S.id, A.weighting_scheme, A.weight_idf, S.created from recommendations_documents_set S JOIN user_models U ON (S.user_model_id = U.id) JOIN algorithms A ON (U.algorithm_id = A.id) WHERE S.created BETWEEN '2013-02-01' AND '2013-12-01' AND A.data_element_type=2 AND A.weighting_scheme=2 AND A.weight_idf=1 ORDER BY S.created

seit 15.5. gibt es keine empfehlungen mehr mit diesen einstellungen.

übrigens, wenn du "AND A.weight_idf=1" änderst zu "AND A.weight_idf=2" werden deutlich mehr Empfehlungen angezeigt. Ich bin mir nicht sicher ob das seine Richtigkeit hat, dass auch bereits vor 15.5. deutlich mehr Empfehlungen mit weight_idf=2 statt weight_idf=1 angezeigt wurden.

stlanger commented 10 years ago

zu: "dass auch bereits vor 15.5. deutlich mehr Empfehlungen mit weight_idf=2 statt weight_idf=1 angezeigt wurden." das passt: 1/3: TF 1/3: TF-IDF auf mindmaps 1/3: TF-IDF auf texten

den rest prüfe ich noch

stlanger commented 10 years ago

zur Anfrage, die seit 15.05. keine Ergebnisse liefert:

ich habe das explizit so gesetzt, dass bei weighting-scheme 2 und data_element_type != 1 auch weight_idf auf 2 gesetzt wird.:

        //citations are used
        if (alg.getDataElementType() != 1) {
            s += "," + (r.nextInt(1000)+1);

            // IDF for citations needs to be based on fulltexts
            if (alg.getWeightingScheme() == 2) {
                alg.setWeightIDF(2);
            }

        }
        alg.setDataElementTypeWeighting(s);

falls das nicht soll, kann ich es entfernen, soll ich?

Joeran commented 10 years ago

ja, entferne das bitte wieder, zumindest sofern es funktioniert. dein kommentar "// IDF for citations needs to be based on fulltexts" hört sich für mich so an, als müsste das zwingend so sein .

stlanger commented 10 years ago

wie per telefon besprochen:

IDF lässt sich bei Zitationen momentan nur auf dem Volltext-Korpus berechnen, da Lucene schon die Document_ids (dcr_xxxxxxxx) enthält.
auf den Mindmaps ist das noch nicht möglich, lässt sich aber mit Hilfe der Tabelle mindmaps_pdfhash implementieren (nicht so wichtig im moment)

BeelGroup / Docear-API

is feature weighting really correct? (tf-idf, tf only, ...) #98