Additional metadata for journal scrapers

ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework

66 stars 33 forks source link

Additional metadata for journal scrapers #29

Closed KnowledgeGarden closed 9 years ago

KnowledgeGarden commented 9 years ago

I mentioned this at the EtherPad. In writing scrapers for my OpenSherlock project, I include fields for Mesh names, substance names, and keywords as additional metadata. I am considering rewriting my scrapers to be instances of these open standards. What are the chances of extending journal scrapers?

blahah commented 9 years ago

Hi, which EtherPad are you referring to? Happy to consider extending the scrapers if a significant proportion of journals contain the information you're interested in. In my latest iteration, I have included keywords.

KnowledgeGarden commented 9 years ago

https://etherpad.mozilla.org/sciencelab-2014summersprint-mining-literature My work with PubMed abstracts suggests keywords are there, as are Mesh names (these are essentially hand-tagged documents by domain experts), and chemical substance names. All of those play key roles when teasing meaning out of documents. My work entails crafting topic maps from text documents; all the hints available are valuable.

blahah commented 9 years ago

Ah, OK. We use etherpads at a lot of events and generally don't use them after the event except for historical archiving.

If PubMed has keywords, Mesh names and chemical substances, I'm happy to scrape them. If you'd like to contribute them to scrapers that would be welcome, and I will also include them in future scrapers. However, I would wait until after this weekend before making any pull requests, as I will push updates to the whole set of scrapers.

KnowledgeGarden commented 9 years ago

As soon as I get past a hard disk crash here, I'll generate a gist which shows what the XML looks like for PubMed keywords, Mesh names and substance names. As to contributing a scraper, that idea is on my mind; at present, I was building OpenSherlock without even knowing about this project, so there is an intention to modify my code to generate compatible scrapes. My question has been answered!