Parsing GAP citations - Githubissues

fliqper / GAP-Citations-Analyzator

GAP Citations project

1 stars 0 forks source link

Parsing GAP citations #8

Open olexandr-konovalov opened 3 years ago

olexandr-konovalov commented 3 years ago

How can we parse GAP citation to break it into parts, in order to be able to see which components of the citation meta-data are present?

fliqper commented 3 years ago

So far I have created functions and isolated Version and Website in separate columns, I also created extra columns

Lenght of citation,
Delay which is the result of citation year minus the cited GAP release year.
Accuracy Score from 0 to 3. One point is awarded for
- overall citation length being greater than 95 characters
- website reference ( regardless if it is a package or pure GAP citation )
- version provided, ( regardless if it is a package or pure GAP citation )

fliqper commented 3 years ago

Update: I found that the MSC code is contained in the mrclass key in the BIB file, so I added that column to the dataset. Now I need to find or create a dictionary which turns each code to the corresponding Science Field so I can then carry out analysis and create some visualisations and make some conclusions.

fliqper commented 3 years ago

I think I can use this for dictionary https://cran.r-project.org/web/classifications/MSC.html or https://zbmath.org/static/msc2020.pdf

olexandr-konovalov commented 3 years ago

overall citation length being greater than 95 characters

Curious, why 95? Do you do any cleaning to remove accidental markup before counting the length?

olexandr-konovalov commented 3 years ago

For the canonical reference, use https://mathscinet.ams.org/msc/pdfs/classifications2020.pdf or https://zbmath.org/static/msc2020.pdf. If you need top level categories in plain text, to save you from PDF anomalies, reuse (2010-careful) https://github.com/gap-system/GapWWW/blob/master/Doc/Bib/MSC2010.g. I suggest not to dive into subcategories.

fliqper commented 3 years ago

I think I can use this for dictionary https://cran.r-project.org/web/classifications/MSC.html or https://zbmath.org/static/msc2020.pdf

Yes I did some extra cleaning, first I looked at the raw data and made notes of symbols that we do not want there such as \ , $, {, }. Then I compiled Regex expressions. Here is a snapshot from the notebook.

We use Regex to further purify the Citation column, removing some remaining special characters.
merged_df['Citation'] = merged_df['Citation'].str.replace(r'[\\\$\{\}\^]', '')
merged_df['Citation'] = merged_df['Citation'].str.replace(r'(ssf)', '')

After that I scrolled through the data again and citations looked fine, no alien characters to be seen. Length was calculated right after this step.