ContentMine / pyCProject

Provides basic function to read a ContentMine CProject and CTrees into python datastructures.
MIT License
3 stars 1 forks source link

Use metadata in get_ functions if scholarly.html not available #14

Open solstag opened 7 years ago

solstag commented 7 years ago

Just an idea as I'm starting to play with pycproject...

I have a mix of open-access and closed articles, all of which have metadata, but only some of which have scholarly.html. This could arise in other situations as well, if for example some issue prevented scholarly.html to be generated for some files, or if I haven't yet downloaded the articles.

In this situation, should I have code that uses get_title, get_abstract etc, I would expect it to get data from scholarly.html if available, but otherwise get what it can from the metadata.

This way I don't have to write two different code paths for open and closed articles, and code that only uses information available in the metadata works before downloading the articles.

Does this make sense? Or is the metadata structure so repository-specific that it makes no sense to try to get information reliably from it?

Cheers

chreman commented 7 years ago

If I understand you correctly, you would like something like

try:
    get_title_from_metadata
except: (if failing for some reason)
    get_title_from_scholarlyhtml

or the other way round? What would you like to have as the primary, and what as the fallback resource?

solstag commented 7 years ago

Well, it will pretty much depend on the relative quality between the metadata and the normalized fulltext you're dealing with, so it would make more sense if the behavior was configurable like

get_title( ... , sources=['scholarly', 'metadata']):
    get_from_source = {'scholarly':get_title_from_scholarlyhtml, 'metadata':get_title_from_metadata}
    for source in sources: 
        try:
            return get_from_source[source]()
        except:
            pass

This would retain scholarly as default primary source, add metadata as secondary, and return None in case no sources are available. It's just an example. []s