SeverinJB / the_lads

Project "Scholarly Network Engine" - Examination for "Computational Thinking and Programming" - Second-cycle degree "Digital Humanities and Digital Knowledge" at the University of Bologna
0 stars 0 forks source link

do_cit_count_year(data, sse, aut, year) #7

Open SeverinJB opened 5 years ago

SeverinJB commented 5 years ago

Functionality:

SeverinJB commented 5 years ago

Peroni's "additions sections" is unclear.

Original:

  • If the parameter year is not None, the keys lesser than year are excluded from the dictionary
  • If the parameter year is None, the lower publication year of the articles authored by aut is considered as starting year
SeverinJB commented 5 years ago

Let's call this V1.0! The code works but is not yet very sexy. +++ do_cit_count_year +++

def do_cit_count_year(data, sse, aut, year):
    data = data.drop(columns='known_reference')
    cit_count_year, doi_year = {}, {}

    if year is None:
        for item in sse.data:
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
            cit_count_year[item['year']] = 0
    else:
        for item in sse.data:
            if aut in item['authors'] and year <= int(item['year']):
                doi_year[item['doi']] = item['year']
            if year <= int(item['year']):
                cit_count_year[item['year']] = 0

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    return cit_count_year

Like always, all files are needed for execution: sne.py, execution_example.py, and the_lads.py

+++ execution_example.py +++

from sne import ScholarlyNetworkEngine
my_sne = ScholarlyNetworkEngine("metadata_sample.csv", "citations_sample.csv")
my_sne.cit_count_year('Michel, Dumontier', 2016)

**+++ the_lads.py +++***

import pandas as pd
import networkx as nx

def process_citation_data(file_path):
    df = pd.read_csv(file_path, index_col='DOI', header=0, names=['DOI', 'citation_number', 'known_reference'])
    return df

def do_cit_count_year(data, sse, aut, year):
    data = data.drop(columns='known_reference')
    cit_count_year, doi_year = {}, {}

    if year is None:
        for item in sse.data:
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
            cit_count_year[item['year']] = 0
    else:
        for item in sse.data:
            if aut in item['authors'] and year <= int(item['year']):
                doi_year[item['doi']] = item['year']
            if year <= int(item['year']):
                cit_count_year[item['year']] = 0

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    return cit_count_year
delfimpandiani commented 5 years ago

Hey Sevi! I spent five hours working on this function until I realized I was answering a different question than was asked (number of papers and not citations). Anyways, after realizing this, I went to your code and dissected it to see how we could make it sexier. I believe it is very nice already. I modified it just a bit to make it a little shorter (15 instead of 18 lines), but it is basically the same:

def do_cit_count_year(data, sse, aut, year):
    cit_count_year, doi_year = {}, {}
    for item in sse.data:
        if year != None and year <= int(item['year']):
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
        elif year is None:
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
    return cit_count_year

I really spent quite a long while trying so many different edits to shorten it but everytime there was a problem with the 0 based values appearing.

One question I had was about your inclusion of data = data.drop(columns='known_reference') . Why did you feel dropping it was necessary? I don't see how this makes the code more efficient, but please do let me know if I am missing something.

Let me know if you want us to keep working on this or if I should add it to our final the_lads.py issue :)

delfimpandiani commented 5 years ago

Just not to lose this code, this is the one I wrote with the mistaken task at hand:

# this code works to see the number of papers authored in certain years
# def do_cit_count_year(data, sse, aut, year):
# #     l = []
# #     min = 2015 if year == None else year
# #     for i in sse.search(aut, 'authors', False, partial_res=None):
# #         if int(i['year']) >= min:
# #             if year != None and year > min: min = year
# #             l.append((i['year'], 1))
# #     a = range(min, 2020)
# #     cit_count_year = {str(year):0 for year in a}
# #     for i in l:
# #         if i[0] not in cit_count_year:
# #             cit_count_year[i[0]] = i[1]
# #         else:
# #             cit_count_year[i[0]] += 1
# #     return cit_count_year
delfimpandiani commented 5 years ago

Also, I tested your edited code with many of the top authors (according to sse.top_ten_authors()):

screen shot 2019-01-09 at 2 34 10 pm screen shot 2019-01-09 at 2 34 04 pm

And seems to works perfectly!

SeverinJB commented 5 years ago

Hi @delfimpandiani, I am sorry to hear about your story but it was surely a learning experience nonetheless.

def do_cit_count_year(data, sse, aut, year):
    cit_count_year, doi_year = {}, {}
    for item in sse.data:
        if year != None and year <= int(item['year']):
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
        elif year is None:
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']
    return cit_count_year

This is so sexy! I am in love. 💘 In principle, the code can be approved. So, lets's say this version is approved and in case anybody adjusts something till the 17th, we can comment on this issue with a new code. Btw. I used many list comprehensions instead of standard loops not only because a one-liner is sexy but because they are in specific cases a tiny bit faster. Namely, list comprehension are faster if you are replacing a for loop which created a list.

One question I had was about your inclusion of data = data.drop(columns='known_reference'). Why did you feel dropping it was necessary? I don't see how this makes the code more efficient, but please do let me know if I am missing something.

You're right. There is no immediate need for dropping it. I did so based on the assumption the code would be faster if our processed data object is smaller. However, I have not read anything about this topic and I have no evidence whether or not this is true.

delfimpandiani commented 5 years ago

Yay!! I am glad you find it sexy 👍 I added it to our the_lads.py Good to know about the one-liners! I did not know that, but it makes a lot of sense. If I read anything related to efficiency and dropping columns, I'll note it here!

delfimpandiani commented 5 years ago
def do_cit_count_year(data, sse, aut, year):
    doi_year, clean_cit_count_year = {}, {}
    a = range(0, 2050)
    cit_count_year = {str(yr):0 for yr in a}

    for item in sse.data:
        if year != None and year <= int(item['year']):
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
        elif year is None:
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    for key, value in cit_count_year.items():
        if year != None:
            if int(key) >= year and int(key) < 2020:
                clean_cit_count_year[key] = value
        elif year is None:
            if value != 0:
                clean_cit_count_year[key] = value # first year with citations
                # ????????
                # how to include all the following years after the first one with a publication, regardless of if they have citations or not

    return clean_cit_count_year
SeverinJB commented 5 years ago

Looking goood! I wrote a similar one which needs to be cleaned as well.

def do_cit_count_year(data, sse, aut, year):
    doi_year = {}

    min_year = int(min([each_book['year'] for each_book in sse.data]))

    if year is None:
        cit_count_year = {str(item): 0 for item in range(min_year, date.datetime.today().year + 1)}
    elif year < date.datetime.today().year:
        cit_count_year = {str(item): 0 for item in range(year, date.datetime.today().year + 1)}
    elif year > date.datetime.today().year:
        cit_count_year = {str(year): 0}

    for item in sse.data:
        if year is not None and year <= int(item['year']):
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']
        elif year is None:
            cit_count_year[item['year']] = 0
            if aut in item['authors']:
                doi_year[item['doi']] = item['year']

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    print(cit_count_year)
    return cit_count_year
delfimpandiani commented 5 years ago

2 main improvements:

  1. according the project description, the returned dictionary has to be one of integers, not strings. screen shot 2019-01-22 at 1 23 51 pm

    So I changed the code so that the returned dictionary's keys are integers and not strings: integers

    int

as opposed to strings

str
  1. the second main improvement is getting rid of unnecessary loops and general clean up. the only part that could use some more clean up if the paragraph for the creation of the dictionary depending on min value. I think it is good as it is, but maybe you'll think of something else.
def do_cit_count_year(data, sse, aut, year):
    doi_year = {}
    min_year = int(min([paper['year'] for paper in sse.data]))

    if year is None:
        cit_count_year = {item: 0 for item in range(min_year, date.datetime.today().year + 1)}
    elif year < date.datetime.today().year:
        cit_count_year = {item: 0 for item in range(year, date.datetime.today().year + 1)}
    elif year > date.datetime.today().year:
        cit_count_year = {year: 0}

    for item in sse.data:
        if aut in item['authors']:
            if year is not None and year <= int(item['year']) or year is None:
                doi_year[item['doi']] = int(item['year'])

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    return cit_count_year

I have updated the_lads.py

SeverinJB commented 5 years ago

@delfimpandiani, Thank you so much for noticing this! I completely overlooked this. Sorry!

I'll check the code in detail later. 😀

SeverinJB commented 5 years ago

@delfimpandiani, I made adjustments only regarding the initial creation of the empty dictionary.

def do_cit_count_year(data, sse, aut, year):
    doi_year = {}
    min_year = min([int(paper['year']) for paper in sse.data])
    year_aut = max([int(paper['year']) for paper in sse.data if aut in paper['authors']])

    if year is None:
        max_year = max([year_aut, date.datetime.today().year])
        cit_count_year = {item: 0 for item in range(min_year, max_year + 1)}
    else:
        max_year = max([year_aut, date.datetime.today().year, year])
        cit_count_year = {item: 0 for item in range(year, max_year + 1)}

    for item in sse.data:
        if aut in item['authors']:
            if year is not None and year <= int(item['year']) or year is None:
                doi_year[item['doi']] = int(item['year'])

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    print(cit_count_year)
    return cit_count_year
delfimpandiani commented 5 years ago

only change is adding "if aut in paper['authors']" for the definition of min_year

def do_cit_count_year(data, sse, aut, year):
    doi_year = {}
    min_year = min([int(paper['year']) for paper in sse.data if aut in paper['authors']])
    year_aut = max([int(paper['year']) for paper in sse.data if aut in paper['authors']])

    if year is None:
        max_year = max([year_aut, date.datetime.today().year])
        cit_count_year = {item: 0 for item in range(min_year, max_year + 1)}
    else:
        max_year = max([year_aut, date.datetime.today().year, year])
        cit_count_year = {item: 0 for item in range(year, max_year + 1)}

    for item in sse.data:
        if aut in item['authors']:
            if year is not None and year <= int(item['year']) or year is None:
                doi_year[item['doi']] = int(item['year'])

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    return cit_count_year
SeverinJB commented 5 years ago
def do_cit_count_year(data, sse, aut, year):
    doi_year = {}
    years_aut = [int(paper['year']) for paper in sse.data if aut in paper['authors']]
    min_years_aut = min(years_aut)
    max_years_aut = max(years_aut)

    if year is None:
        max_year = max([max_years_aut, date.datetime.today().year])
        cit_count_year = {item: 0 for item in range(min_years_aut, max_year + 1)}
    else:
        max_year = max([max_years_aut, date.datetime.today().year, year])
        cit_count_year = {item: 0 for item in range(year, max_year + 1)}

    for item in sse.data:
        if aut in item['authors']:
            if year is not None and year <= int(item['year']) or year is None:
                doi_year[item['doi']] = int(item['year'])

    for index, row in data.iterrows():
        if index in doi_year:
            cit_count_year[doi_year[index]] = cit_count_year[doi_year[index]] + row['citation_number']

    return cit_count_year
SeverinJB commented 5 years ago

It's funny how a lot can be refactored quite neatly. What a pity I didn't have the energy before. 😢 But I don't know whether it really makes a difference in computation time.

def do_cit_count_year(data, sse, aut, year):
    doi_dict = {paper['doi']: int(paper['year']) for paper in sse.data if aut in paper['authors']}
    min_year = min(doi_dict.values()) if year is None else year
    max_year = max(date.datetime.today().year, max(doi_dict.values()), min_year)
    cit_dict = {item: 0 for item in range(min_year, max_year + 1)}

    for index, row in data.iterrows():
        if index in doi_dict and doi_dict[index] in cit_dict:
            cit_dict[doi_dict[index]] = cit_dict[doi_dict[index]] + row['citation_number']

    return cit_dict
delfimpandiani commented 5 years ago

It's funny how a lot can be refactored quite neatly. What a pity I didn't have the energy before. 😢 But I don't know whether it really makes a difference in computation time.

def do_cit_count_year(data, sse, aut, year):
    doi_dict = {paper['doi']: int(paper['year']) for paper in sse.data if aut in paper['authors']}
    min_year = min(doi_dict.values()) if year is None else year
    max_year = max(date.datetime.today().year, max(doi_dict.values()), min_year)
    cit_dict = {item: 0 for item in range(min_year, max_year + 1)}

    for index, row in data.iterrows():
        if index in doi_dict and doi_dict[index] in cit_dict:
            cit_dict[doi_dict[index]] = cit_dict[doi_dict[index]] + row['citation_number']

    return cit_dict

just looked at this in detail and it is SO beautiful.