BeelGroup / Mr.-DLib-Server

Other
12 stars 4 forks source link

Store statistics about the recommendations themselves #21

Closed Joeran closed 8 years ago

Joeran commented 8 years ago

when creating and delivering recommendations, store as many statistis as possible

Joeran commented 8 years ago

for docear, we stored these statistics (certainly, not all of them are relevant for mr dlib). @MillahMau i suggest that we go through the variables tomorrow in detail.

recommendations_sets.csv
###############################
set_id                  ID of the recommendation set
set_created             Date when the set was created
delivered               Date when the set was delivered
data_source_limitation          We distinguished between mind-maps that were in the users' libraries (data_source_limitation=1) and all mind-maps, i.e. those in the libraries and those stored somewhere else (data_source_limitation=0)
data_element                1 = entire mind-maps were utilized; 2=nodes were utilized
data_element_type           0 = text and citations; 1 = text only; 2 = citations only
citation_weight             If both citations and text were utilized, 'citation_weight' expresses how much stronger citations were weighted than text
element_selection_method        0 = all nodes were utilized; 1=only edited nodes were utilized; 2 = only newly created nodes were utilized; 3 = only moved nodes were utilized
element_amount              Number of mind-maps or nodes that were utilized
root_path               0 = parent nodes were not added to the original selection; 1= parent nodes were added
child_nodes             0 = childes not were not added to the original selection; 1 = child nodes were added
sibling_nodes               0 = sibling nodes were not added to the original selection; 1 = sibling nodes were added
result_amount               Number of terms or keywords that should taken from the previously chosen nodes
weighting_scheme            1 = TF only; 2 = TF*IDF
weight_idf              1 = TF*IDuF; 2 = classic TF*IDF
feature_weight_submission       1 = Weight of the features is stored (and later sent to Lucene); 0 =  Weight is not stored
set_trigger             1 = new recommendations were created because a new mind-map was uploaded; 2 = new recs were created because the user had requested recommendations; 3 = new recs were created because Docear automatically displayed recommendations
application_id              Unique ID of the Docear version (higher number means later version)
auto                    1 = Docear requested recommendation automatically; 0 = User requested recommendations explicitely 
old                 Is '1' if the set was "old", i.e. a more recent set already existed but was also delivered already. This could happen, if a user requests recommendations very often. 
label_id                ID of the label that users see when they receive recommendations (e.g. "Free Research Papers")
label_type              1 = organic; 2 = commercial; 3 = none
label_text              Text of the label
sponsored_prefix            If '1' then the first recommendation was labeled "[Sponsored] <Title of Recommendation>"
highlight               If '1' then the sponsored prefix was highlighted with a red background
user_id_(anonymized)            ID of the user receiving the set of recommendations
user_type               2 = registered user; 3 = anonymous user
registrationdate            Date of registration
year_of_birth               Year the user was born
gender                  0 = female; 1 = male
user_model_creation_time        Time in milli seconds to create the user model
set_computation_time            Time to match user model and recommendations
set_delivery_time           Time from the recommendation request to delivering the recommendations
rec_amount_potential            Number of recommendation candidates (typically 1000+ for term based)
rec_amount_current          Number of actually delivered recommendations (typically 10; or less if rec_amount_potential<10)
rec_original_rank_max           Docear randomly selects ten recommendations of the top50 candidates. rec_original_rank_max expresses the maximum rank of the ten selected recommendations
rec_original_rank_min           The minimum rank of the selected recommendations
ratio_keywords              If both terms and citations were used, ratio_keywords expresses the percentage of terms (e.g. 0.8 means that out of 10 features, eight were terms)
ratio_references            The ratio of references
mindmap_count_total         Total number of mind-maps the user created
node_count_total            Total number of nodes the user created
paper_count_total           Total number of papers the user has linked
node_count_before_expanded*     The number of nodes that were originally selected for the user modeling process
node_count_expanded*            Number of nodes after siblings, children, etc. were added
feature_count_expanded*         Number of features (terms and/or citation) of those nodes being in node_count_expanded
feature_count_expanded_unique*      Number of unique features
feature_count_reduced*          Number of features after stop words were removed
feature_count_reduced_unique*       Number of unique features after stop words were removed
um_size_relative*           Relative user model size (e.g. 0.4 if the user model is based on 40 nodes but the user had created 100 nodes)
rec_clicked_count           Number of recommendations that the user had clicked in this set
rec_clicked_ctr             CTR of this set
user_days_started           Number of days on which the user has started Docear at least once
user_days_since_registered      Number of days since the user registered
node_depth              Weighting nodes based on their depths. 0 = off; 1 = the deeper the more weight; 2 = the deeper the less weight
node_depth_metric           0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
no_children             Weighting nodes based on the number of children. 0 = off; 1 = the more children the more weight; 2 = the more children the less weight
no_children_metric          0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
no_siblings             Weighting nodes based on the number of siblings. 0 = off; 1 = the more siblings the more weight; 2 = the more siblings the less weight
no_siblings_metric          0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
word_count              Weighting nodes based on the number of word in that node. 0 = off; 1 = the more words the more weight; 2 = the more words the less weight
word_count_metric           0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
node_weight_combo_scheme        Scheme how to combine the individual node weights. 0 = sum; 1 = multiply values; 2 = use maximum value only; 3 = average

recommendations.csv
###################

set_id                  ID of the recommendation set
set_created             Date when the set was created
delivered               Date when the set was delivered
clicked                 Date when the recommendation was clicked
document_id_(anonymized)        ID of the recommended document
original_rank               Original Lucene rank
presentation_rank           Rank at which the recommendation was displayed
relevance               Relevance score of Lucene
data_source_limitation          We distinguished between mind-maps that were in the users' libraries (data_source_limitation=1) and all mind-maps, i.e. those in the libraries and those stored somewhere else (data_source_limitation=0)
data_element                1 = entire mind-maps were utilized; 2=nodes were utilized
data_element_type           0 = text and citations; 1 = text only; 2 = citations only
citation_weight             If both citations and text were utilized, 'citation_weight' expresses how much stronger citations were weighted than text
element_selection_method        0 = all nodes were utilized; 1=only edited nodes were utilized; 2 = only newly created nodes were utilized; 3 = only moved nodes were utilized
element_amount              Number of mind-maps or nodes that were utilized
root_path               0 = parent nodes were not added to the original selection; 1= parent nodes were added
child_nodes             0 = childes not were not added to the original selection; 1 = child nodes were added
sibling_nodes               0 = sibling nodes were not added to the original selection; 1 = sibling nodes were added
result_amount               Number of terms or keywords that should taken from the previously chosen nodes
weighting_scheme            1 = TF only; 2 = TF*IDF
weight_idf              1 = TF*IDuF; 2 = classic TF*IDF
feature_weight_submission       1 = Weight of the features is stored (and later sent to Lucene); 0 =  Weight is not stored
set_trigger             1 = new recommendations were created because a new mind-map was uploaded; 2 = new recs were created because the user had requested recommendations; 3 = new recs were created because Docear automatically displayed recommendations
application_id              Unique ID of the Docear version (higher number means later version)
auto                    1 = Docear requested recommendation automatically; 0 = User requested recommendations explicitely 
old                 Is '1' if the set was "old", i.e. a more recent set already existed but was also delivered already. This could happen, if a user requests recommendations very often. 
label_id                ID of the label that users see when they receive recommendations (e.g. "Free Research Papers")
label_type              1 = organic; 2 = commercial; 3 = none
label_text              Text of the label
sponsored_prefix            If '1' then the first recommendation was labeled "[Sponsored] <Title of Recommendation>"
highlight               If '1' then the sponsored prefix was highlighted with a red background
user_id_(anonymized)            ID of the user receiving the set of recommendations
user_type               2 = registered user; 3 = anonymous user
year_of_birth               Year the user was born
gender                  0 = female; 1 = male
user_model_creation_time        Time in milli seconds to create the user model
set_computation_time            Time to match user model and recommendations
set_delivery_time           Time from the recommendation request to delivering the recommendations
rec_amount_potential            Number of recommendation candidates (typically 1000+ for term based)
rec_amount_current          Number of actually delivered recommendations (typically 10; or less if rec_amount_potential<10)
rec_original_rank_max           Docear randomly selects ten recommendations of the top50 candidates. rec_original_rank_max expresses the maximum rank of the ten selected recommendations
rec_original_rank_min           The minimum rank of the selected recommendations
ratio_keywords              If both terms and citations were used, ratio_keywords expresses the percentage of terms (e.g. 0.8 means that out of 10 features, eight were terms)
ratio_references            The ratio of references
mindmap_count_total         Total number of mind-maps the user created
node_count_total            Total number of nodes the user created
paper_count_total           Total number of papers the user has linked
node_count_before_expanded*     The number of nodes that were originally selected for the user modeling process
node_count_expanded*            Number of nodes after siblings, children, etc. were added
feature_count_expanded*         Number of features (terms and/or citation) of those nodes being in node_count_expanded
feature_count_expanded_unique*      Number of unique features
feature_count_reduced*          Number of features after stop words were removed
feature_count_reduced_unique*       Number of unique features after stop words were removed
um_size_relative*           Relative user model size (e.g. 0.4 if the user model is based on 40 nodes but the user had created 100 nodes)
rec_clicked_count           Number of recommendations that the user had clicked in this set
rec_clicked_ctr             CTR of this set
user_days_started           Number of days on which the user has started Docear at least once
user_days_since_registered      Number of days since the user registered
node_depth              Weighting nodes based on their depths. 0 = off; 1 = the deeper the more weight; 2 = the deeper the less weight
node_depth_metric           0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
no_children             Weighting nodes based on the number of children. 0 = off; 1 = the more children the more weight; 2 = the more children the less weight
no_children_metric          0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
no_siblings             Weighting nodes based on the number of siblings. 0 = off; 1 = the more siblings the more weight; 2 = the more siblings the less weight
no_siblings_metric          0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
word_count              Weighting nodes based on the number of word in that node. 0 = off; 1 = the more words the more weight; 2 = the more words the less weight
word_count_metric           0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root 
node_weight_combo_scheme        Scheme how to combine the individual node weights. 0 = sum; 1 = multiply values; 2 = use maximum value only; 3 = average