Open SeverinJB opened 5 years ago
Peroni's "additions sections" is unclear.
Original:
- If the parameter year is not None, the keys lesser than year are excluded from the dictionary
- If the parameter year is None, the lower publication year of the articles authored by aut is considered as starting year
Let's call this V1.0! The code works but is not yet very sexy. +++ do_cit_count_year +++
def do_cit_count_year(data, sse, aut, year):
data = data.drop(columns='known_reference')
cit_count_year, doi_year = {}, {}
if year is None:
for item in sse.data:
if aut in item['authors']:
doi_year[item['doi']] = item['year']
cit_count_year[item['year']] = 0
else:
for item in sse.data:
if aut in item['authors'] and year <= int(item['year']):
doi_year[item['doi']] = item['year']
if year <= int(item['year']):
cit_count_year[item['year']] = 0
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
Like always, all files are needed for execution: sne.py, execution_example.py, and the_lads.py
+++ execution_example.py +++
from sne import ScholarlyNetworkEngine
my_sne = ScholarlyNetworkEngine("metadata_sample.csv", "citations_sample.csv")
my_sne.cit_count_year('Michel, Dumontier', 2016)
**+++ the_lads.py +++***
import pandas as pd
import networkx as nx
def process_citation_data(file_path):
df = pd.read_csv(file_path, index_col='DOI', header=0, names=['DOI', 'citation_number', 'known_reference'])
return df
def do_cit_count_year(data, sse, aut, year):
data = data.drop(columns='known_reference')
cit_count_year, doi_year = {}, {}
if year is None:
for item in sse.data:
if aut in item['authors']:
doi_year[item['doi']] = item['year']
cit_count_year[item['year']] = 0
else:
for item in sse.data:
if aut in item['authors'] and year <= int(item['year']):
doi_year[item['doi']] = item['year']
if year <= int(item['year']):
cit_count_year[item['year']] = 0
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
Hey Sevi! I spent five hours working on this function until I realized I was answering a different question than was asked (number of papers and not citations). Anyways, after realizing this, I went to your code and dissected it to see how we could make it sexier. I believe it is very nice already. I modified it just a bit to make it a little shorter (15 instead of 18 lines), but it is basically the same:
def do_cit_count_year(data, sse, aut, year):
cit_count_year, doi_year = {}, {}
for item in sse.data:
if year != None and year <= int(item['year']):
cit_count_year[item['year']] = 0
if aut in item['authors']:
doi_year[item['doi']] = item['year']
elif year is None:
cit_count_year[item['year']] = 0
if aut in item['authors']:
doi_year[item['doi']] = item['year']
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
I really spent quite a long while trying so many different edits to shorten it but everytime there was a problem with the 0 based values appearing.
One question I had was about your inclusion of data = data.drop(columns='known_reference')
. Why did you feel dropping it was necessary? I don't see how this makes the code more efficient, but please do let me know if I am missing something.
Let me know if you want us to keep working on this or if I should add it to our final the_lads.py issue :)
Just not to lose this code, this is the one I wrote with the mistaken task at hand:
# this code works to see the number of papers authored in certain years
# def do_cit_count_year(data, sse, aut, year):
# # l = []
# # min = 2015 if year == None else year
# # for i in sse.search(aut, 'authors', False, partial_res=None):
# # if int(i['year']) >= min:
# # if year != None and year > min: min = year
# # l.append((i['year'], 1))
# # a = range(min, 2020)
# # cit_count_year = {str(year):0 for year in a}
# # for i in l:
# # if i[0] not in cit_count_year:
# # cit_count_year[i[0]] = i[1]
# # else:
# # cit_count_year[i[0]] += 1
# # return cit_count_year
Also, I tested your edited code with many of the top authors (according to sse.top_ten_authors()):
And seems to works perfectly!
Hi @delfimpandiani, I am sorry to hear about your story but it was surely a learning experience nonetheless.
def do_cit_count_year(data, sse, aut, year): cit_count_year, doi_year = {}, {} for item in sse.data: if year != None and year <= int(item['year']): cit_count_year[item['year']] = 0 if aut in item['authors']: doi_year[item['doi']] = item['year'] elif year is None: cit_count_year[item['year']] = 0 if aut in item['authors']: doi_year[item['doi']] = item['year'] for index, row in data.iterrows(): if index in doi_year: cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number'] return cit_count_year
This is so sexy! I am in love. 💘 In principle, the code can be approved. So, lets's say this version is approved and in case anybody adjusts something till the 17th, we can comment on this issue with a new code. Btw. I used many list comprehensions instead of standard loops not only because a one-liner is sexy but because they are in specific cases a tiny bit faster. Namely, list comprehension are faster if you are replacing a for loop which created a list.
One question I had was about your inclusion of
data = data.drop(columns='known_reference')
. Why did you feel dropping it was necessary? I don't see how this makes the code more efficient, but please do let me know if I am missing something.
You're right. There is no immediate need for dropping it. I did so based on the assumption the code would be faster if our processed data object is smaller. However, I have not read anything about this topic and I have no evidence whether or not this is true.
Yay!! I am glad you find it sexy 👍 I added it to our the_lads.py Good to know about the one-liners! I did not know that, but it makes a lot of sense. If I read anything related to efficiency and dropping columns, I'll note it here!
def do_cit_count_year(data, sse, aut, year):
doi_year, clean_cit_count_year = {}, {}
a = range(0, 2050)
cit_count_year = {str(yr):0 for yr in a}
for item in sse.data:
if year != None and year <= int(item['year']):
if aut in item['authors']:
doi_year[item['doi']] = item['year']
elif year is None:
if aut in item['authors']:
doi_year[item['doi']] = item['year']
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
for key, value in cit_count_year.items():
if year != None:
if int(key) >= year and int(key) < 2020:
clean_cit_count_year[key] = value
elif year is None:
if value != 0:
clean_cit_count_year[key] = value # first year with citations
# ????????
# how to include all the following years after the first one with a publication, regardless of if they have citations or not
return clean_cit_count_year
Looking goood! I wrote a similar one which needs to be cleaned as well.
def do_cit_count_year(data, sse, aut, year):
doi_year = {}
min_year = int(min([each_book['year'] for each_book in sse.data]))
if year is None:
cit_count_year = {str(item): 0 for item in range(min_year, date.datetime.today().year + 1)}
elif year < date.datetime.today().year:
cit_count_year = {str(item): 0 for item in range(year, date.datetime.today().year + 1)}
elif year > date.datetime.today().year:
cit_count_year = {str(year): 0}
for item in sse.data:
if year is not None and year <= int(item['year']):
cit_count_year[item['year']] = 0
if aut in item['authors']:
doi_year[item['doi']] = item['year']
elif year is None:
cit_count_year[item['year']] = 0
if aut in item['authors']:
doi_year[item['doi']] = item['year']
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
print(cit_count_year)
return cit_count_year
2 main improvements:
So I changed the code so that the returned dictionary's keys are integers and not strings: integers
as opposed to strings
def do_cit_count_year(data, sse, aut, year):
doi_year = {}
min_year = int(min([paper['year'] for paper in sse.data]))
if year is None:
cit_count_year = {item: 0 for item in range(min_year, date.datetime.today().year + 1)}
elif year < date.datetime.today().year:
cit_count_year = {item: 0 for item in range(year, date.datetime.today().year + 1)}
elif year > date.datetime.today().year:
cit_count_year = {year: 0}
for item in sse.data:
if aut in item['authors']:
if year is not None and year <= int(item['year']) or year is None:
doi_year[item['doi']] = int(item['year'])
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
I have updated the_lads.py
@delfimpandiani, Thank you so much for noticing this! I completely overlooked this. Sorry!
I'll check the code in detail later. 😀
@delfimpandiani, I made adjustments only regarding the initial creation of the empty dictionary.
def do_cit_count_year(data, sse, aut, year):
doi_year = {}
min_year = min([int(paper['year']) for paper in sse.data])
year_aut = max([int(paper['year']) for paper in sse.data if aut in paper['authors']])
if year is None:
max_year = max([year_aut, date.datetime.today().year])
cit_count_year = {item: 0 for item in range(min_year, max_year + 1)}
else:
max_year = max([year_aut, date.datetime.today().year, year])
cit_count_year = {item: 0 for item in range(year, max_year + 1)}
for item in sse.data:
if aut in item['authors']:
if year is not None and year <= int(item['year']) or year is None:
doi_year[item['doi']] = int(item['year'])
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
print(cit_count_year)
return cit_count_year
only change is adding "if aut in paper['authors']" for the definition of min_year
def do_cit_count_year(data, sse, aut, year):
doi_year = {}
min_year = min([int(paper['year']) for paper in sse.data if aut in paper['authors']])
year_aut = max([int(paper['year']) for paper in sse.data if aut in paper['authors']])
if year is None:
max_year = max([year_aut, date.datetime.today().year])
cit_count_year = {item: 0 for item in range(min_year, max_year + 1)}
else:
max_year = max([year_aut, date.datetime.today().year, year])
cit_count_year = {item: 0 for item in range(year, max_year + 1)}
for item in sse.data:
if aut in item['authors']:
if year is not None and year <= int(item['year']) or year is None:
doi_year[item['doi']] = int(item['year'])
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
def do_cit_count_year(data, sse, aut, year):
doi_year = {}
years_aut = [int(paper['year']) for paper in sse.data if aut in paper['authors']]
min_years_aut = min(years_aut)
max_years_aut = max(years_aut)
if year is None:
max_year = max([max_years_aut, date.datetime.today().year])
cit_count_year = {item: 0 for item in range(min_years_aut, max_year + 1)}
else:
max_year = max([max_years_aut, date.datetime.today().year, year])
cit_count_year = {item: 0 for item in range(year, max_year + 1)}
for item in sse.data:
if aut in item['authors']:
if year is not None and year <= int(item['year']) or year is None:
doi_year[item['doi']] = int(item['year'])
for index, row in data.iterrows():
if index in doi_year:
cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
return cit_count_year
It's funny how a lot can be refactored quite neatly. What a pity I didn't have the energy before. 😢 But I don't know whether it really makes a difference in computation time.
def do_cit_count_year(data, sse, aut, year):
doi_dict = {paper['doi']: int(paper['year']) for paper in sse.data if aut in paper['authors']}
min_year = min(doi_dict.values()) if year is None else year
max_year = max(date.datetime.today().year, max(doi_dict.values()), min_year)
cit_dict = {item: 0 for item in range(min_year, max_year + 1)}
for index, row in data.iterrows():
if index in doi_dict and doi_dict[index] in cit_dict:
cit_dict[doi_dict[index]] = cit_dict[doi_dict[index]] + row['citation_number']
return cit_dict
It's funny how a lot can be refactored quite neatly. What a pity I didn't have the energy before. 😢 But I don't know whether it really makes a difference in computation time.
def do_cit_count_year(data, sse, aut, year): doi_dict = {paper['doi']: int(paper['year']) for paper in sse.data if aut in paper['authors']} min_year = min(doi_dict.values()) if year is None else year max_year = max(date.datetime.today().year, max(doi_dict.values()), min_year) cit_dict = {item: 0 for item in range(min_year, max_year + 1)} for index, row in data.iterrows(): if index in doi_dict and doi_dict[index] in cit_dict: cit_dict[doi_dict[index]] = cit_dict[doi_dict[index]] + row['citation_number'] return cit_dict
just looked at this in detail and it is SO beautiful.
Functionality: