Right now to generate the dataset the following steps are being taken:
users download collections of entries manually from Scopus, Web of Science, Scielo, etc. These entries are saved as .bib files.
bib files are opened and read with a Python library
some of the columns in each bib are renamed, as they are named different in the different bibliography portals, but mean the same.
the entries are joined in one csv file (allData.csv)
this file is read in R and cleaned up:
different bibliography portals have different formatting ways, some use double quotes for things that are marked with single quotes in others.
Some typos make two entries which are the same title be recognized as different entries.
Some entries have titles in two different languages (one always being english), and the second language makes two same entries be recognized as different entries (still need to double check though).
Now we can get all the unique entries in allData.csv, but there is some work to do:
[ ] find rows with the same titles
[ ] see if they refer to the same "type" (eg if both are "article" or "book", etc)
[ ] in case they are the same "type" join their information as this is not the same in different databases
but for the moment we could simply drop one of the entries.
Right now to generate the dataset the following steps are being taken:
Now we can get all the unique entries in allData.csv, but there is some work to do: