Closed Joeran closed 8 years ago
for docear, we stored these statistics (certainly, not all of them are relevant for mr dlib). @MillahMau i suggest that we go through the variables tomorrow in detail.
recommendations_sets.csv
###############################
set_id ID of the recommendation set
set_created Date when the set was created
delivered Date when the set was delivered
data_source_limitation We distinguished between mind-maps that were in the users' libraries (data_source_limitation=1) and all mind-maps, i.e. those in the libraries and those stored somewhere else (data_source_limitation=0)
data_element 1 = entire mind-maps were utilized; 2=nodes were utilized
data_element_type 0 = text and citations; 1 = text only; 2 = citations only
citation_weight If both citations and text were utilized, 'citation_weight' expresses how much stronger citations were weighted than text
element_selection_method 0 = all nodes were utilized; 1=only edited nodes were utilized; 2 = only newly created nodes were utilized; 3 = only moved nodes were utilized
element_amount Number of mind-maps or nodes that were utilized
root_path 0 = parent nodes were not added to the original selection; 1= parent nodes were added
child_nodes 0 = childes not were not added to the original selection; 1 = child nodes were added
sibling_nodes 0 = sibling nodes were not added to the original selection; 1 = sibling nodes were added
result_amount Number of terms or keywords that should taken from the previously chosen nodes
weighting_scheme 1 = TF only; 2 = TF*IDF
weight_idf 1 = TF*IDuF; 2 = classic TF*IDF
feature_weight_submission 1 = Weight of the features is stored (and later sent to Lucene); 0 = Weight is not stored
set_trigger 1 = new recommendations were created because a new mind-map was uploaded; 2 = new recs were created because the user had requested recommendations; 3 = new recs were created because Docear automatically displayed recommendations
application_id Unique ID of the Docear version (higher number means later version)
auto 1 = Docear requested recommendation automatically; 0 = User requested recommendations explicitely
old Is '1' if the set was "old", i.e. a more recent set already existed but was also delivered already. This could happen, if a user requests recommendations very often.
label_id ID of the label that users see when they receive recommendations (e.g. "Free Research Papers")
label_type 1 = organic; 2 = commercial; 3 = none
label_text Text of the label
sponsored_prefix If '1' then the first recommendation was labeled "[Sponsored] <Title of Recommendation>"
highlight If '1' then the sponsored prefix was highlighted with a red background
user_id_(anonymized) ID of the user receiving the set of recommendations
user_type 2 = registered user; 3 = anonymous user
registrationdate Date of registration
year_of_birth Year the user was born
gender 0 = female; 1 = male
user_model_creation_time Time in milli seconds to create the user model
set_computation_time Time to match user model and recommendations
set_delivery_time Time from the recommendation request to delivering the recommendations
rec_amount_potential Number of recommendation candidates (typically 1000+ for term based)
rec_amount_current Number of actually delivered recommendations (typically 10; or less if rec_amount_potential<10)
rec_original_rank_max Docear randomly selects ten recommendations of the top50 candidates. rec_original_rank_max expresses the maximum rank of the ten selected recommendations
rec_original_rank_min The minimum rank of the selected recommendations
ratio_keywords If both terms and citations were used, ratio_keywords expresses the percentage of terms (e.g. 0.8 means that out of 10 features, eight were terms)
ratio_references The ratio of references
mindmap_count_total Total number of mind-maps the user created
node_count_total Total number of nodes the user created
paper_count_total Total number of papers the user has linked
node_count_before_expanded* The number of nodes that were originally selected for the user modeling process
node_count_expanded* Number of nodes after siblings, children, etc. were added
feature_count_expanded* Number of features (terms and/or citation) of those nodes being in node_count_expanded
feature_count_expanded_unique* Number of unique features
feature_count_reduced* Number of features after stop words were removed
feature_count_reduced_unique* Number of unique features after stop words were removed
um_size_relative* Relative user model size (e.g. 0.4 if the user model is based on 40 nodes but the user had created 100 nodes)
rec_clicked_count Number of recommendations that the user had clicked in this set
rec_clicked_ctr CTR of this set
user_days_started Number of days on which the user has started Docear at least once
user_days_since_registered Number of days since the user registered
node_depth Weighting nodes based on their depths. 0 = off; 1 = the deeper the more weight; 2 = the deeper the less weight
node_depth_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
no_children Weighting nodes based on the number of children. 0 = off; 1 = the more children the more weight; 2 = the more children the less weight
no_children_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
no_siblings Weighting nodes based on the number of siblings. 0 = off; 1 = the more siblings the more weight; 2 = the more siblings the less weight
no_siblings_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
word_count Weighting nodes based on the number of word in that node. 0 = off; 1 = the more words the more weight; 2 = the more words the less weight
word_count_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
node_weight_combo_scheme Scheme how to combine the individual node weights. 0 = sum; 1 = multiply values; 2 = use maximum value only; 3 = average
recommendations.csv
###################
set_id ID of the recommendation set
set_created Date when the set was created
delivered Date when the set was delivered
clicked Date when the recommendation was clicked
document_id_(anonymized) ID of the recommended document
original_rank Original Lucene rank
presentation_rank Rank at which the recommendation was displayed
relevance Relevance score of Lucene
data_source_limitation We distinguished between mind-maps that were in the users' libraries (data_source_limitation=1) and all mind-maps, i.e. those in the libraries and those stored somewhere else (data_source_limitation=0)
data_element 1 = entire mind-maps were utilized; 2=nodes were utilized
data_element_type 0 = text and citations; 1 = text only; 2 = citations only
citation_weight If both citations and text were utilized, 'citation_weight' expresses how much stronger citations were weighted than text
element_selection_method 0 = all nodes were utilized; 1=only edited nodes were utilized; 2 = only newly created nodes were utilized; 3 = only moved nodes were utilized
element_amount Number of mind-maps or nodes that were utilized
root_path 0 = parent nodes were not added to the original selection; 1= parent nodes were added
child_nodes 0 = childes not were not added to the original selection; 1 = child nodes were added
sibling_nodes 0 = sibling nodes were not added to the original selection; 1 = sibling nodes were added
result_amount Number of terms or keywords that should taken from the previously chosen nodes
weighting_scheme 1 = TF only; 2 = TF*IDF
weight_idf 1 = TF*IDuF; 2 = classic TF*IDF
feature_weight_submission 1 = Weight of the features is stored (and later sent to Lucene); 0 = Weight is not stored
set_trigger 1 = new recommendations were created because a new mind-map was uploaded; 2 = new recs were created because the user had requested recommendations; 3 = new recs were created because Docear automatically displayed recommendations
application_id Unique ID of the Docear version (higher number means later version)
auto 1 = Docear requested recommendation automatically; 0 = User requested recommendations explicitely
old Is '1' if the set was "old", i.e. a more recent set already existed but was also delivered already. This could happen, if a user requests recommendations very often.
label_id ID of the label that users see when they receive recommendations (e.g. "Free Research Papers")
label_type 1 = organic; 2 = commercial; 3 = none
label_text Text of the label
sponsored_prefix If '1' then the first recommendation was labeled "[Sponsored] <Title of Recommendation>"
highlight If '1' then the sponsored prefix was highlighted with a red background
user_id_(anonymized) ID of the user receiving the set of recommendations
user_type 2 = registered user; 3 = anonymous user
year_of_birth Year the user was born
gender 0 = female; 1 = male
user_model_creation_time Time in milli seconds to create the user model
set_computation_time Time to match user model and recommendations
set_delivery_time Time from the recommendation request to delivering the recommendations
rec_amount_potential Number of recommendation candidates (typically 1000+ for term based)
rec_amount_current Number of actually delivered recommendations (typically 10; or less if rec_amount_potential<10)
rec_original_rank_max Docear randomly selects ten recommendations of the top50 candidates. rec_original_rank_max expresses the maximum rank of the ten selected recommendations
rec_original_rank_min The minimum rank of the selected recommendations
ratio_keywords If both terms and citations were used, ratio_keywords expresses the percentage of terms (e.g. 0.8 means that out of 10 features, eight were terms)
ratio_references The ratio of references
mindmap_count_total Total number of mind-maps the user created
node_count_total Total number of nodes the user created
paper_count_total Total number of papers the user has linked
node_count_before_expanded* The number of nodes that were originally selected for the user modeling process
node_count_expanded* Number of nodes after siblings, children, etc. were added
feature_count_expanded* Number of features (terms and/or citation) of those nodes being in node_count_expanded
feature_count_expanded_unique* Number of unique features
feature_count_reduced* Number of features after stop words were removed
feature_count_reduced_unique* Number of unique features after stop words were removed
um_size_relative* Relative user model size (e.g. 0.4 if the user model is based on 40 nodes but the user had created 100 nodes)
rec_clicked_count Number of recommendations that the user had clicked in this set
rec_clicked_ctr CTR of this set
user_days_started Number of days on which the user has started Docear at least once
user_days_since_registered Number of days since the user registered
node_depth Weighting nodes based on their depths. 0 = off; 1 = the deeper the more weight; 2 = the deeper the less weight
node_depth_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
no_children Weighting nodes based on the number of children. 0 = off; 1 = the more children the more weight; 2 = the more children the less weight
no_children_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
no_siblings Weighting nodes based on the number of siblings. 0 = off; 1 = the more siblings the more weight; 2 = the more siblings the less weight
no_siblings_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
word_count Weighting nodes based on the number of word in that node. 0 = off; 1 = the more words the more weight; 2 = the more words the less weight
word_count_metric 0 = absolute value; 1 = natural logarithm; 2 = logarism to base 10; 3 = square root
node_weight_combo_scheme Scheme how to combine the individual node weights. 0 = sum; 1 = multiply values; 2 = use maximum value only; 3 = average
when creating and delivering recommendations, store as many statistis as possible