culibraries / ir-scholar

CU Scholar - Institutional Repository Hyrax
0 stars 0 forks source link

Data Load - Abstracts from Bepress #18

Closed mbstacy closed 4 years ago

mbstacy commented 4 years ago

Convert special characters and html tags <sup> and <sub> (<b><u><i><em> ignored)

See below for information received from Katie regarding Abstracts.

Here’s links to the lists we used most commonly to replace characters in the abstracts:

Greek/Coptic: https://www.w3schools.com/charsets/ref_utf_greek.asp Mathematical Operators: https://www.w3schools.com/charsets/ref_utf_math.asp Arrows: https://www.w3schools.com/charsets/ref_utf_arrows.asp

Latin Supplement: https://www.w3schools.com/charsets/ref_utf_latin1_supplement.asp (mostly punctuation marks) Diacritical Marks: https://www.w3schools.com/charsets/ref_utf_diacritical.asp (transliteration of foreign languages in papers where the English keyboard defaults were insufficient)

Not many, but might be a couple here or there: Latin Extended A: https://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp Latin Extended B: https://www.w3schools.com/charsets/ref_utf_latin_extended_b.asp

I don’t think we ever used   , but it might not be a bad idea to throw it in there just in case.

If it doesn’t have an entry under the “Entity” column, we didn’t do any additional workarounds to get the letter or symbol. All replacements we made for symbols were formatted δ

As far as HTML goes, stuff we used that I can think of right off would be:

We’ve learned the hard way a few times that if you import a CSV or Excel file into Excel with any of these encoded characters, they blow up and transform into unintelligible sequences of symbols. If you need to review the data and then save the spreadsheet, we recommend using Google Sheets. There are several collections in LUNA that we can’t touch in Excel because of the problems with diacritics getting messed up. Don’t know if this was on your radar, but figured it might be worth mentioning. Thanks! Katie Fletcher

mbstacy commented 4 years ago

Slight problems with superscript, but data loading has started and metadata loading. Complete