Closed Joeran closed 10 years ago
wenn dieses ticket gelöst ist, ist wahrscheinlich auch #57 obsolet
die tabelle tmp_rec_users enthält übrigens auch doppelte user. diese tabelle ist ja schon etwas älter. falls du für tmp_user_person daten von tmp_rec_users übernommen hast, könnte der fehler daran liegen.
Bitte erweitere die Tabelle tmp_user_person auch noch mit einem Join auf recommendations_users_settings (die infos fehlen bisher)
Und Die Anzahl der Paper pro user müsste noch in die Tabelle. Die Query ist
select user, COUNT(distinct pdfhash) as papers from mindmaps_pdfhash P join mindmaps M on (P.mindmap_id = M.id) group by M.user
ticket #42 sollte ebenfalls beachtet werden.
da steckt irgendwo ein fehler drin. folgendes ist mir aufgefallen:
laut dieser query
select * from tmp_user_person where U_user=258
hat der user JamesRoughton 433 recommendation sets und 3871 einzelne Empfehlungen erhalten.
Laut dieser Query (die ich immer zum exportieren nutze) hat james aber über 4000 empfehlungen erhalten
select
S.id AS set_id, S.created AS set_created, S.delivered,D.clicked,
D.id AS rec_id, D.fulltext_url_id AS fulltext_id, D.original_rank, D.presentation_rank, D.relevance,
S.user_model_id,
UM.model AS user_model_content,
A.id AS algorithm_id, A.data_source, A.data_source_limitation, A.data_element, A.data_element_type, A.data_element_type_weighting, A.element_selection_method, A.element_amount, A.root_path, A.time_frame, A.child_nodes, A.sibling_nodes, A.stop_word_removal, A.stemming, A.result_amount, A.approach, A.weighting_scheme, A.weight_tf, A.weight_idf, A.feature_weight_submission,
S.trigger_type AS set_trigger,S.application_id, S.auto, S.`old`,
L.id AS label_id, L.`type` AS label_type, L.value AS label_text,
US.use_prefix AS sponsored_prefix, US.highlight,
S.user_id,
U.username, U.`type` AS user_type, U.lang AS user_language, U.registrationdate, U.allow_backup, U.allow_content_research, U.allow_information_retrieval, U.allow_usage_research, U.allow_recommendations, U.remote_address,
P.dob AS day_of_birth, P.gender,
UM.execution_time AS user_model_creation_time,
S.computation_time AS set_computation_time, S.delivery_time AS set_delivery_time,
S.rec_amount_potential, S.rec_amount_should, S.rec_amount_current, S.rec_original_rank_max, S.rec_original_rank_min, S.rec_selected_from_top_x,
UM.ratio_keywords, UM.ratio_references, UM.entity_total_count, UM.mindmap_count_total, UM.node_count_total, UM.paper_count_total, UM.node_count_before_expanded, UM.node_count_expanded, UM.feature_count_expanded, UM.feature_count_expanded_unique, UM.feature_count_reduced, UM.feature_count_reduced_unique, UM.um_size_relative, UM.um_feature_weight_max, UM.um_feature_weight_min, UM.um_feature_weight_avg, UM.um_feature_weight_top3_avg, UM.um_feature_weight_top5_avg, UM.um_feature_weight_top10_avg, UM.um_feature_weight_last3_avg, UM.um_feature_weight_last5_avg, UM.um_feature_weight_last10_avg,
S.rec_clicked_count, S.rec_clicked_ctr, S.user_days_started, S.user_days_since_registered, S.user_sets_delivered, S.offline_evaluator,
A.node_depth, A.node_depth_metric, A.no_children, A.no_children_metric, A.no_siblings, A.no_siblings_metric, A.word_count, A.word_count_metric, A.node_weight_combo_scheme, A.node_weight_normalization,
A.node_visibility, A.no_days_since_max, A.no_days_since_chosen, A.no_children_level, A.default_algorithm, S.received, S.offline_evaluator_paper_position, S.user_rating, D.shown_before, UM.link_count_total, A.node_info_source
from recommendations_documents_set S
JOIN recommendations_users_settings US ON (S.user_id = US.user_id)
JOIN recommendations_labels L ON (L.id = US.recommendations_labels_id)
JOIN recommendations_documents D ON (S.id = D.recommendations_documents_set_id)
JOIN users U ON (U.id = S.user_id)
JOIN user_models UM ON (UM.id = S.user_model_id)
JOIN algorithms A ON (A.id = UM.algorithm_id)
LEFT JOIN persons P ON (P.id = U.person_id)
WHERE US.user_id =258 AND S.delivered IS NOT NULL
was stimmt nicht? die query die ich zum export nutze? oder tmp_user_person?
found why there are duplicates: it's due to the join with tmp_rec_users
the reason for duplicates in tmp_rec_users is in stats_recommendations_users.py:
query = """INSERT INTO tmp_rec_users(user_id) SELECT id FROM users WHERE id NOT IN (1, 2, 27)""";
db.query(query)
#insert users and sets_total
query = """INSERT INTO tmp_rec_users(user_id, sets_total)
SELECT user_id, count(*) AS counter FROM recommendations_documents_set S
WHERE user_id NOT IN (1, 2, 27) AND S.delivered IS NOT NULL AND (offline_evaluator IS NULL || offline_evaluator = 0) GROUP BY user_id""";
db.query(query)
trying to figure out which data exactly I need in this table - really all users or just the ones with recommendations
tmp_user_person should include data from all registered users, also those without activated recommendations
duplicates resolved
Joeran: both queries for the user_id 258 now return the same numbers - could you please verify?
T_stated_docear_on_x_days is computed using the table "users_applications":
SELECT A.user_id, COUNT(DISTINCT(DATE(A.time))) AS counter
FROM users_applications A
WHERE user_id IS NOT NULL
GROUP BY A.user_id
this table is filled whenever Docear searches for updates (on every start by default)
stefan hat alles gefixt
die tabelle tmp_user_person enthält doppelte einträge (es sollte jeder user aber nur einmal vorkommen. auf den ersten blick sieht es so aus als würden die einträge sich bei der spalte T_id unterscheiden. Alle anderen Spalten für doppelte NutzerIDs scheinen identisch zu sein.
Original Message https://mail.google.com/mail/u/0/#inbox/144732b5c0fe8cd8