BeelGroup / Docear-API

Web-based components (recommender, database, hibernate, ...)
0 stars 0 forks source link

is feature weighting really correct? (tf-idf, tf only, ...) #98

Closed Joeran closed 10 years ago

Joeran commented 10 years ago

This chart shows CTRs and runtimes for weighting terms and citations with TF, TF-IDF (based on the PDF corpus), and TF-IDF (based on user's mind-maps). All data is since August 2013 and later.

image

  1. We see that average runtime of TF is higher than of TF-IDF (Corpus). That is not plausible. For both, TF and TF-IDF you need to calculate the term frequency TF. Only for TF-IDF you additionally calculate IDF. This means, calculating TF-IDF must require more time than calculating TF only.
  2. TF-IDF (MM) was never applied to citations since around mid of 2013. Why? (before this can be answered, probably #78 should be fixed)
stlanger commented 10 years ago
select A.weighting_scheme, avg(S.computation_time), avg(U.execution_time) from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2014-01-01' AND A.approach<>2
GROUP BY A.weighting_scheme

image

stlanger commented 10 years ago

very strange: lucene query time is much worse for TF-only user models if calculated since 2013-08-01:

select A.weighting_scheme, round(avg(S.computation_time)) AS lucene_query_time, round(avg(U.execution_time)) AS model_creation_time from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2013-08-01' AND A.approach<>2
GROUP BY A.weighting_scheme

image

stlanger commented 10 years ago

for Term-only recommendations user model creation differs more: image

stlanger commented 10 years ago

update!: just a coincident:

without feature boosting time for models with the same size, time is about equal:

select A.weighting_scheme, round(avg(S.computation_time)) AS lucene_query_time, round(avg(U.execution_time)) AS model_creation_time from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (A.id = U.algorithm_id)
WHERE S.created > '2013-08-01' AND A.approach<>2
AND A.data_element_type=1
AND A.data_element_type_weighting='1'
AND A.feature_weight_submission=0
AND U.feature_count_reduced_unique between 590 AND 600
AND (A.default_algorithm <> 1 OR A.default_algorithm IS NULL)
GROUP BY A.weighting_scheme

image

with boosting there is a huge difference: image

stlanger commented 10 years ago

:) :) difference between lucene query time of TF vs TF-IDF generated terms is simply that TF favored terms are generally found in many more documents than terms favored by TF-IDF

--> with TF much more results need to be rated and merged by lucene for every term

stlanger commented 10 years ago

regarding 2nd question:

created recommendations for MM with TF (weighting_scheme 1) vs TF-IDF (weighting_scheme 2)

select A.weighting_scheme, count(*) AS count
from recommendations_documents_set S 
JOIN user_models U ON (S.user_model_id = U.id)
JOIN algorithms A ON (U.algorithm_id = A.id)
WHERE S.created BETWEEN '2013-04-01' AND '2013-05-01'
AND A.data_element=1
AND A.default_algorithm IS NULL
GROUP BY A.weighting_scheme

Apr 2013 image

Aug 2013 image

Oct 2013 image

Feb 2014 image

Dec 2013 image

Joeran commented 10 years ago

ich glaube, da haben wir uns missverstanden. probiere mal das hier:

select S.id, A.weighting_scheme, A.weight_idf, S.created from recommendations_documents_set S JOIN user_models U ON (S.user_model_id = U.id) JOIN algorithms A ON (U.algorithm_id = A.id) WHERE S.created BETWEEN '2013-02-01' AND '2013-12-01' AND A.data_element_type=2 AND A.weighting_scheme=2 AND A.weight_idf=1 ORDER BY S.created

seit 15.5. gibt es keine empfehlungen mehr mit diesen einstellungen.

übrigens, wenn du "AND A.weight_idf=1" änderst zu "AND A.weight_idf=2" werden deutlich mehr Empfehlungen angezeigt. Ich bin mir nicht sicher ob das seine Richtigkeit hat, dass auch bereits vor 15.5. deutlich mehr Empfehlungen mit weight_idf=2 statt weight_idf=1 angezeigt wurden.

stlanger commented 10 years ago

zu: "dass auch bereits vor 15.5. deutlich mehr Empfehlungen mit weight_idf=2 statt weight_idf=1 angezeigt wurden." das passt: 1/3: TF 1/3: TF-IDF auf mindmaps 1/3: TF-IDF auf texten

den rest prüfe ich noch

stlanger commented 10 years ago

zur Anfrage, die seit 15.05. keine Ergebnisse liefert:

ich habe das explizit so gesetzt, dass bei weighting-scheme 2 und data_element_type != 1 auch weight_idf auf 2 gesetzt wird.:

        //citations are used
        if (alg.getDataElementType() != 1) {
            s += "," + (r.nextInt(1000)+1);

            // IDF for citations needs to be based on fulltexts
            if (alg.getWeightingScheme() == 2) {
                alg.setWeightIDF(2);
            }

        }
        alg.setDataElementTypeWeighting(s);

falls das nicht soll, kann ich es entfernen, soll ich?

Joeran commented 10 years ago

ja, entferne das bitte wieder, zumindest sofern es funktioniert. dein kommentar "// IDF for citations needs to be based on fulltexts" hört sich für mich so an, als müsste das zwingend so sein .

stlanger commented 10 years ago

wie per telefon besprochen: