Closed tsalo closed 8 years ago
Have you seen this for a parser: http://freecite.library.brown.edu/ -- it picks out the informational content from a citation. Here it has color coded the parts of the citation:
I did not copy the color codes but for this citation they are correct.
The most appropriate fork seems to be this one from Academia.edu. This fork drops the web-app elements and provides a Ruby API instead. I don't know any Ruby, but I'm sure we could hack together something in Python to call the package. Free-cite extracts a number of useful features from the references, which we could use to create unique identifiers as features.
I also looked into a few other packages, including pdfx (which was pretty ineffective) and refextract (which didn't actually seem applicable).
The fork you reference refers to this one. Not sure if there is a difference that matters.
The other fork seems like it's more updated, but is also focused on the web-app, rather than an API. That's just based on the README, though.
They are both in Ruby, though, so the internal code may matter if there are any improvements.
Oh wait, when you say "web app" do you mean that it runs as a server? If the long run plan is to make some web distributables on this project, maybe building to local servers--even for internal services--might be a good target along the way.
Or are you imagining calling the Ruby from within a Python program with the Ruby code wrapped (a la C) as a Python package? It might be easier to run the Ruby as an independent process and just access it as a service via the RESTful API. Then all you need is Python's URLlib (whichever version is most current). Well, you have to start the translator process, but that is not a big deal.
It does seem to run as a server. I wonder if there's a limit for API calls. It doesn't mention it in the README so I'm going to assume not.
In any case, my plan was to call the Ruby code from within Python using system calls or something. It's not that I think that's a good idea, it's just that that's all I know how to do. Using it as a service sounds better though, assuming they don't charge.
No, sorry, I was not clear. I think you can download the code to your own system, then run it locally (at localhost:3000) and attach to it that way--no other computer involved, a "local server." The cURL example is running that way.
Okay. In that case I guess there's no reason not to run it as a server.
@mriedel56 and I will attempt to incorporate this into the paper version in #27.
CrossRef could possibly use PubMed metadata to identify references (both cited papers and papers citing the article).
References seem like a good source of information. Jason wrote regular expressions to extract references last semester, but I think we need a more developed tool. I will look into available extraction tools.