JabRef / abbrv.jabref.org

A repository of abbreviations for references, e.g., for conferences, journals, institutes, etc.
https://abbrv.jabref.org
Creative Commons Zero v1.0 Universal
142 stars 80 forks source link

Add quality checker #149

Open koppor opened 1 year ago

koppor commented 1 year ago

I needed to fix lists, because "wrong" lists were in. See https://github.com/JabRef/abbrv.jabref.org/pull/148

We should have a checker. Following are the tasks it should check:

ERROR: Wrong escape

"Zeszyty Naukowe Wy\","Problemy Mat."
"Journal of Evolutionary Biochemistry and Physiology\","J. Evol. Biochem. Physiol."

ERROR: Wrong beginning letters

"Zeszyty Naukowe Wy\","Problemy Mat."

(This is https://github.com/JabRef/abbrv.jabref.org/issues/107)

ERROR: List contains non-UTF8 characters

This is https://github.com/JabRef/abbrv.jabref.org/issues/125.

WARN: Double entries

"Advances in Applied Mathematics","Adv. Appl. Math."
"Advances in Applied Mathematics","Adv. in Appl. Math."

(This refs https://github.com/JabRef/abbrv.jabref.org/issues/77)

WARN: Same full form appearing twice

"Advances in Applied Mathematics","Adv. Appl. Math."
"Advances in Applied Mathematics","Adv. in Appl. Math."

(This refs https://github.com/JabRef/abbrv.jabref.org/issues/77)

WARN: Same abbrevation appearing twice

"Advances in Data Analysis and Classification. ADAC","Adv. Data Anal. Classif."
"Advances in Data Analysis and Classification. ADAC. Theory, Methods, and Applications in Data Science","Adv. Data Anal. Classif."

(This refs https://github.com/JabRef/abbrv.jabref.org/issues/77)

WARN: abbreviation is the same as the full text

"Quantum","Quantum"

WARN: Management is abbreviated with outdated "Manage." instead of "Manag.

This is https://github.com/JabRef/abbrv.jabref.org/issues/78

northword commented 1 year ago

WARN: abbreviation is the same as the full text

When journal name is only one word,its abbreviation is the same as the full name.
e.g. full name: Fuel , its abbrev is Fuel.

philcaz commented 1 month ago

Hi, I would like to tackle this issue with my group : )

koppor commented 1 month ago

@northword I think, the expected result is a Python tool residing in https://github.com/JabRef/abbrv.jabref.org/tree/main/scripts. It should print out issues and exit with failure code if issues are found. -- You can chose another programming language of you want.

Example output of lychee, which has another purpose, but also outputs check results:

Image

(Source: https://github.com/JabRef/jabref/actions/runs/11361716475)

philcaz commented 1 month ago

Hey, when implementing the check logic for 'WARN: abbreviation is the same as the full text,' should we only give a warning if the journal's name has more than one word and the abbreviation is the same as its full name? If the journal name is just one word, as @northword mentioned, should we simply pass it?

koppor commented 1 month ago

Hey, when implementing the check logic for 'WARN: abbreviation is the same as the full text,' should we only give a warning if the journal's name has more than one word and the abbreviation is the same as its full name? If the journal name is just one word, as @northword mentioned, should we simply pass it?

Yes.

philcaz commented 1 month ago

My current function that checks the validity of starting letters of abbreviations considers the below entries as invalid, because the starting letters of the abbreviations do not match well with the full names.

Full: 'Polish Academy of Sciences', Abbrev: 'Acta Phys. Polon. A' Full: 'Jagellonian University', Abbrev: 'Acta Phys. Polon. B' Full: 'Universităţii din Timișoara', Abbrev: 'An. Univ. Timișoara Ser. Mat.-Inform.' Full: 'Universităţii "Ovidius" Constanţa', Abbrev: 'An. Ştiinţ. Univ. Ovidius Constanţa Ser. Mat.'

However, these abbreviations seem to be legitimate for the corresponding full names, though not being obvious. Could you provide some idea how I should refine the criteria of invalidity?

koppor commented 1 month ago

Maybe a hard coded list of exceptions? 😅

philcaz commented 1 month ago

Not sure how many there are to be hardcoded : ( I might try using some similarity threshold to check them. That way abbreviations that are legitimate but are too different from the original full names would fail the check. Does that work?

koppor commented 1 month ago

Not sure how many there are to be hardcoded : ( I might try using some similarity threshold to check them. That way abbreviations that are legitimate but are too different from the original full names would fail the check. Does that work?

I haven't tried.

Maybe test cases need to be generated.

Maybe warnings can be output. Then an exception file generated by the user. Similar to .lycheeignore for the link checker lychee.

Obe might aslo output a number stating the distance.

For manual lists, this is helpful.

For downloaded lists, reports could be made.

I think, there are bugs in the lists.