Open olexandr-konovalov opened 3 years ago
So far I have created functions and isolated Version
and Website
in separate columns, I also created extra columns
Lenght
of citation,Delay
which is the result of citation year minus the cited GAP release year.Accuracy Score
from 0 to 3. One point is awarded for
Update:
I found that the MSC code is contained in the mrclass
key in the BIB file, so I added that column to the dataset. Now I need to find or create a dictionary which turns each code to the corresponding Science Field so I can then carry out analysis and create some visualisations and make some conclusions.
I think I can use this for dictionary https://cran.r-project.org/web/classifications/MSC.html or https://zbmath.org/static/msc2020.pdf
overall citation length being greater than 95 characters
Curious, why 95? Do you do any cleaning to remove accidental markup before counting the length?
For the canonical reference, use https://mathscinet.ams.org/msc/pdfs/classifications2020.pdf or https://zbmath.org/static/msc2020.pdf. If you need top level categories in plain text, to save you from PDF anomalies, reuse (2010-careful) https://github.com/gap-system/GapWWW/blob/master/Doc/Bib/MSC2010.g. I suggest not to dive into subcategories.
I think I can use this for dictionary https://cran.r-project.org/web/classifications/MSC.html or https://zbmath.org/static/msc2020.pdf
Yes I did some extra cleaning, first I looked at the raw data and made notes of symbols that we do not want there such as \ , $, {, }
. Then I compiled Regex expressions. Here is a snapshot from the notebook.
We use Regex to further purify the Citation column, removing some remaining special characters.
merged_df['Citation'] = merged_df['Citation'].str.replace(r'[\\\$\{\}\^]', '')
merged_df['Citation'] = merged_df['Citation'].str.replace(r'(ssf)', '')
After that I scrolled through the data again and citations looked fine, no alien characters to be seen. Length was calculated right after this step.
How can we parse GAP citation to break it into parts, in order to be able to see which components of the citation meta-data are present?