NIME-conference / nime-website

NIME Conference Website in Jekyll
5 stars 6 forks source link

Paper indexing feature #4

Open tatecarson opened 4 years ago

tatecarson commented 4 years ago

Hello,

I was just checking out the NIME Publication ecosystem workshop and I thought I would post an idea for feedback.

I am always finding myself wanting to search the entire proceedings to see if something has been mentioned before and usually can only do this with papers that are on google scholar, or ones I have on my computer. I know that you can do a search of the titles on the proceedings page but it can be very slow and it's just not an ideal way to do research.

It would also be great if there was some way of searching the text of these documents from one place. I think this would make the archive much more meaningful.

I am not sure how difficult it would be to do this. I think you could generate an index of all of the papers offline and then have that be searchable and somehow linked to each PDF? I am interested in helping implement this but I do not know how to do each part. I am also not sure if it is even something that people need or want.

Thanks, looking forward to hearing your thoughts.

cpmpercussion commented 4 years ago

Hi Tate,

This is an awesome idea, actually the search-ability of the text of the NIME proceedings is pretty low!

One idea would be to do some text analysis by scraping the text out of every PDF, I have a few idea about how to do this but it might have to happen offline (e.g., with language analysis tools in python).

It actually sounds like it would be a good project, I might remember a few ideas about how we could do it sometime during NIME this week!

alexarje commented 4 years ago

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

tatecarson commented 4 years ago

I would love to help but am not so clear on how to link the text scraped from the pdfs to the website. I can do the pdftotext portion though. I have also used ocrmypdf previously and it has worked well. I'm not sure if older NIME papers might require OCR.

I think it would be a good idea to look for a model of a proceedings that allows searching. I will look for that and see how it looks on the frontend. Maybe someone else has an idea.

alexarje commented 4 years ago

There is no need for OCR, although the PDF quality of some of the early conferences is a bit sketchy.

Great if you can look for some examples!

tatecarson commented 4 years ago

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

How did you automate this process? I'm having trouble figuring out that part.

After that, we can use something like elasticlunr to create a searchable front end. I am a little concerned that it will be too large to work well.

tatecarson commented 4 years ago

Mini-conf actually seems like a pretty good solution to this problem. Are you all looking at adopting some of its features? I saw it mentioned in another issue. It looks all around really great.

It doesn't do full-text searching but after some research, it seems like this would be a little difficult to do, especially with a static site. A search with titles and keywords is much better than no search at all though.

alexarje commented 4 years ago

Yes, and I like the network visualization they have there: http://www.mini-conf.org/paper_vis.html. How would this possibly work on nime.org?

alexarje commented 4 years ago

Just wonder whether any of this would help in solving the Google Scholar issue as well? @tatecarson if you are interested in testing this out, that would be great!

tatecarson commented 4 years ago

I can look into how to combine this with the current NIME site. Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

alexarje commented 4 years ago

Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

It is difficult to know exactly what is wrong, but it appears that only/mainly articles that have been self-archived in institutional repositories end up in Google Scholar. So there may be something wrong somewhere... We used to use the papercite plugin for Wordpress, which has a way of creating metadata that works well with Google Scholar. But after we changed to the new web page, we have had problems (I think). There is a little more info in the cookbook about this.

tatecarson commented 4 years ago

Hm, I will look into that. I am currently looking at the NIME website trying to figure out how exactly the proceedings page is generated. I see that this shortcode is doing some work with the bibjekyll plugin but I don't quite understand where that {{references}} shortcode is hooked up to the plugin.

alexarje commented 4 years ago

I think @cpmpercussion need to assist you here, since he set it up.

cpmpercussion commented 4 years ago

@tatecarson , that bibliography layout is the template for each entry in any of Jekyll-scholar's reference lists. The "reference" object is the formatted entry, e.g.:

Tate Carson. 2019. Mesh Garden: A creative-based musical game for participatory musical performance . Proceedings of the International Conference on New Interfaces for Musical Expression, UFRGS, pp. 339–342. http://doi.org/10.5281/zenodo.3672986

In the archives.md page, the tag {% bibliography --file nime_papers %} actually gets Jekyll-scholar to generate the big reference list.

I guess it hits this in Jekyll-scholar.