BeelGroup / Docear-API

Web-based components (recommender, database, hibernate, ...)
0 stars 0 forks source link

tmp_user_person ist fehlerhaft (doppelte einträge & falsche zahen?) #82

Closed Joeran closed 10 years ago

Joeran commented 10 years ago

die tabelle tmp_user_person enthält doppelte einträge (es sollte jeder user aber nur einmal vorkommen. auf den ersten blick sieht es so aus als würden die einträge sich bei der spalte T_id unterscheiden. Alle anderen Spalten für doppelte NutzerIDs scheinen identisch zu sein.

image

Original Message https://mail.google.com/mail/u/0/#inbox/144732b5c0fe8cd8

Joeran commented 10 years ago

wenn dieses ticket gelöst ist, ist wahrscheinlich auch #57 obsolet

Joeran commented 10 years ago

die tabelle tmp_rec_users enthält übrigens auch doppelte user. diese tabelle ist ja schon etwas älter. falls du für tmp_user_person daten von tmp_rec_users übernommen hast, könnte der fehler daran liegen.

Joeran commented 10 years ago

Bitte erweitere die Tabelle tmp_user_person auch noch mit einem Join auf recommendations_users_settings (die infos fehlen bisher)

Joeran commented 10 years ago

Und Die Anzahl der Paper pro user müsste noch in die Tabelle. Die Query ist

select user, COUNT(distinct pdfhash) as papers from mindmaps_pdfhash P join mindmaps M on (P.mindmap_id = M.id) group by M.user

ticket #42 sollte ebenfalls beachtet werden.

Joeran commented 10 years ago

da steckt irgendwo ein fehler drin. folgendes ist mir aufgefallen:

laut dieser query

select * from tmp_user_person  where U_user=258

hat der user JamesRoughton 433 recommendation sets und 3871 einzelne Empfehlungen erhalten.

Laut dieser Query (die ich immer zum exportieren nutze) hat james aber über 4000 empfehlungen erhalten

select 
    S.id AS set_id, S.created AS set_created, S.delivered,D.clicked,
    D.id AS rec_id, D.fulltext_url_id AS fulltext_id, D.original_rank, D.presentation_rank, D.relevance,
    S.user_model_id,
    UM.model AS user_model_content,
    A.id AS algorithm_id, A.data_source, A.data_source_limitation, A.data_element, A.data_element_type, A.data_element_type_weighting, A.element_selection_method, A.element_amount, A.root_path, A.time_frame, A.child_nodes, A.sibling_nodes, A.stop_word_removal, A.stemming, A.result_amount, A.approach, A.weighting_scheme, A.weight_tf, A.weight_idf, A.feature_weight_submission,
    S.trigger_type AS set_trigger,S.application_id, S.auto, S.`old`, 
    L.id AS label_id, L.`type` AS label_type, L.value AS label_text, 
    US.use_prefix AS sponsored_prefix, US.highlight,           
    S.user_id,
    U.username, U.`type` AS user_type, U.lang AS user_language, U.registrationdate, U.allow_backup, U.allow_content_research, U.allow_information_retrieval, U.allow_usage_research, U.allow_recommendations, U.remote_address, 
    P.dob AS day_of_birth, P.gender, 
    UM.execution_time AS user_model_creation_time,
    S.computation_time AS set_computation_time, S.delivery_time AS set_delivery_time,
    S.rec_amount_potential, S.rec_amount_should, S.rec_amount_current, S.rec_original_rank_max, S.rec_original_rank_min, S.rec_selected_from_top_x, 
    UM.ratio_keywords, UM.ratio_references, UM.entity_total_count, UM.mindmap_count_total, UM.node_count_total, UM.paper_count_total, UM.node_count_before_expanded, UM.node_count_expanded, UM.feature_count_expanded, UM.feature_count_expanded_unique, UM.feature_count_reduced, UM.feature_count_reduced_unique, UM.um_size_relative, UM.um_feature_weight_max, UM.um_feature_weight_min, UM.um_feature_weight_avg, UM.um_feature_weight_top3_avg, UM.um_feature_weight_top5_avg, UM.um_feature_weight_top10_avg, UM.um_feature_weight_last3_avg, UM.um_feature_weight_last5_avg, UM.um_feature_weight_last10_avg,
    S.rec_clicked_count, S.rec_clicked_ctr, S.user_days_started, S.user_days_since_registered, S.user_sets_delivered, S.offline_evaluator,
    A.node_depth, A.node_depth_metric, A.no_children, A.no_children_metric, A.no_siblings, A.no_siblings_metric, A.word_count, A.word_count_metric, A.node_weight_combo_scheme, A.node_weight_normalization,
    A.node_visibility, A.no_days_since_max, A.no_days_since_chosen, A.no_children_level, A.default_algorithm, S.received, S.offline_evaluator_paper_position, S.user_rating, D.shown_before, UM.link_count_total, A.node_info_source
from recommendations_documents_set S
JOIN recommendations_users_settings US ON (S.user_id = US.user_id)
JOIN recommendations_labels L ON (L.id = US.recommendations_labels_id)
JOIN recommendations_documents D ON (S.id = D.recommendations_documents_set_id)
JOIN users U ON (U.id = S.user_id)
JOIN user_models UM ON (UM.id = S.user_model_id)
JOIN algorithms A ON (A.id = UM.algorithm_id)
LEFT JOIN persons P ON (P.id = U.person_id)
WHERE US.user_id =258 AND S.delivered IS NOT NULL 

was stimmt nicht? die query die ich zum export nutze? oder tmp_user_person?

stlanger commented 10 years ago

found why there are duplicates: it's due to the join with tmp_rec_users image

the reason for duplicates in tmp_rec_users is in stats_recommendations_users.py:

    query = """INSERT INTO tmp_rec_users(user_id) SELECT id FROM users WHERE id NOT IN (1, 2, 27)""";
    db.query(query)

    #insert users and sets_total    
    query = """INSERT INTO tmp_rec_users(user_id, sets_total) 
            SELECT user_id, count(*) AS counter FROM recommendations_documents_set S 
            WHERE user_id NOT IN (1, 2, 27) AND S.delivered IS NOT NULL AND (offline_evaluator IS NULL || offline_evaluator = 0) GROUP BY user_id""";
    db.query(query)

trying to figure out which data exactly I need in this table - really all users or just the ones with recommendations

Joeran commented 10 years ago

tmp_user_person should include data from all registered users, also those without activated recommendations

stlanger commented 10 years ago

duplicates resolved

stlanger commented 10 years ago

Joeran: both queries for the user_id 258 now return the same numbers - could you please verify?

stlanger commented 10 years ago

T_stated_docear_on_x_days is computed using the table "users_applications":

            SELECT A.user_id, COUNT(DISTINCT(DATE(A.time))) AS counter 
            FROM users_applications A 
            WHERE user_id IS NOT NULL 
            GROUP BY A.user_id

this table is filled whenever Docear searches for updates (on every start by default)

Joeran commented 10 years ago

stefan hat alles gefixt