amchagas / open-hardware-supply

having a closer look on how OSH papers are evolving over time
MIT License
5 stars 2 forks source link

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

Closed solstag closed 5 months ago

solstag commented 3 years ago

Ni! Here's a new approach me and @amchagas put together earlier today. This comes after our conclusion (see #8) that one can't easily automatically tell from the abstract whether a paper is about open hardware, and our previous approaches (Wos+Scopus+Scielo) were thus very limiting since they don't permit full text search.

The new approach consists of:

  1. Search for "open hardware" or "open-source hardware" (or something similar), saving all result pages and extracting article data. That will be about 20k results, with 10 results per page it makes for 2k requests.

    Here's an example of a proper (i.e. using Scrapy) Google Scholar scraper : Build Your Own Google Scholar API With Python Scrapy

  2. Using the titles and whatever metadata (first autor, ano), do a WoS/Scopus search to get their full metadata and DOI, and to confirm whether each article falls into our scope (published journal and conference papers?).

  3. As far as we manually treat the full text contents, we can get the full text manually as that's very little ovehead. If we're automating, then get the full text through either Google Scholar, Unpaywall, and Sci-hub if necessary.

amchagas commented 3 years ago

Ok, after a long time without managing to work on this, I looked at the link showing how to build a spider using scrapy and spend a bit of time trying to download things... the first query I managed was "open hardware" and the csv file can be found here:

https://github.com/amchagas/open-hardware-supply/blob/master/data/scrapy/openhardware_query.csv

have to check in more details the content of the data, and maybe we could have a chat on where to take this?

solstag commented 3 years ago

Legal! I agree we should use a more restrained query when searching the full text. Probably only things we are really sure refer to what we want, like combinations of ("open", "open-source") with ("hardware", "science hardware", "scientific hardware"), this way we can argue a well defined scope tied to the usage of the term in the full text. Let me know when is a good time for a chat!

amchagas commented 3 years ago

A little update: We learned the hard way that google scholar only displays 1000 results of whatever search term is being used. So the search "open hardware" gives 14800 hits, but only the first 1000 are displayed.

This makes things complicated as we are then very far away from reaching a data collection that is representative of all the papers that have "open hardware" in the body somewhere.

There are other things being tried to see if we can break down search terms and results to smaller than 1000 references, and scrape things "slowly"

One more learned thing is that GS allows searches like "open * hardware" which finds "open source hardware", "open science hardware", "open access hardware" and "open loop hardware", but it does not find "open hardware".

removing terms (that is having a search that involves all terms but one ie "open source hardware" AND "open science hardware" AND NOT "open loop hardware") can be done with a syntax like this (the order of terms matter): "open * hardware" -loop

amchagas commented 3 years ago

Here is some new data all files are "JSON line" files that can be opened up with pandas.. as can be seen in the code link below

and here some new code

I think it is time for a chat again...

solstag commented 3 years ago

Ni! Let's chat chat chat. This week and early next week I am busy with responding to a reviewer. Maybe Thursday or Friday next week?

amchagas commented 3 years ago

Works for me! I'll DM you for time details

solstag commented 2 years ago

Note on how to match gscholar entries with wos entries given our new powers of querying the wos api: "permite puxar os títulos, mas precisa melhorar a entrada da busca com outros elementos além do título (e.g., primeiro autor do resultado gscholar e talvez ano) e a lógica de escolher o melhor resultado na saída (e.g., checar presença do primeiro autor gscholar na lista de autores wos, checar se ano é próximo ± 1)"

solstag commented 2 years ago

https://github.com/amchagas/open-hardware-supply/commit/85ed6211f6dd98b8cdb5ccea7b196b301e82bbc3 implements querying passing author name and year, and then checking title, author name and year among the matching records, with tolerance for small errors, and logging of accepted non-exact matches for later manual verification. There are still some corners to polish, but I suspect we should be able to get the WoS records for pretty much every scraped entry that has one now.

solstag commented 2 years ago

Currently running on one of my lab's servers to produce the full WoS records database from the Google Scholar entries (=