cern-sis / issues-scoap3

0 stars 0 forks source link

cleanup data for all publishers #24

Closed drjova closed 2 years ago

drjova commented 2 years ago

In this step we should cleanup the data after parsing publishers. The cleanup for each fields is described here.

At the end of this step we should check if the JSON is still a valid SCOAP3 schema.

sources: loader.py, pipeline.py

Final one:

ErnestaP commented 2 years ago

(task moved to the first comment)

ErnestaP commented 2 years ago

There are no steps that have to be joined for cleaning. In the end, cleaning had just two steps left: remove "for the" from the collaboration string and remove white spaces from abstracts, title, subtitle, and free keywords. These steps can be done (and are done now) in generic parsing, taking Springer as the initial source. Later, it might change. For springer, the output from its parser is a string without tags. No additional cleaning is needed. However, for other publishers, might be that we will have a string with tags. For example, for IOP has latex expressions for abstracts in MathML tags, OUP and Elsevier have in cdata tags. It might that we have to add more cleaning functions in generic parsing later for it.