amchagas / open-hardware-supply

having a closer look on how OSH papers are evolving over time
MIT License
5 stars 2 forks source link

Integrate Plos Collection Data Dump in sources #8

Open solstag opened 3 years ago

solstag commented 3 years ago

Two perspectives:

  1. ensure we'got all articles in this collection covered
  2. together with the other sources, produce a larger collection for https://github.com/amchagas/open-source-toolkit

In doing this I may be tempted to refactor data loading a little bit.

amchagas commented 3 years ago

I worked on the original file that lives in https://github.com/amchagas/open-source-toolkit. Namely classified all entries as hardware, software or both... Now it should be easy to filter out only the hardware ones.

Sorry I am only seeing that you working on the data now.. would you mind giving a little bit more detail on what you have done so far?

solstag commented 3 years ago

ACK, since it's such a small file, I'll update the code to download it from that repository instead of keeping a copy. We'll talk details on Monday (:

solstag commented 3 years ago

Ni! André (@amchagas), when you find the time, can you work a bit more on the plos-items.csv? Specifically:

PS: I think I've explained the situation in DM, but if you wanna see what I'm seeing check out https://github.com/amchagas/open-hardware-supply/commit/b2cab2793c0ee2cf7b25c802cbe48263344bd787 .~´

amchagas commented 3 years ago

Hi @solstag !

just worked a bit more on the csv file, it is uploaded to the repo.

to keep track of which entries were missing DOIs, I added a new column "missing DOIs" and added the missing information there. I managed to get 32 DOIs, but in the process of doing so, noticed that there was some missclassified entries. For instance, some things that should have been classified as "web articles" were classified as "research articles" (the page showing the GOSH manifesto was one of them). counted 14 of these missclassifications (but note that in this number are also things that were the other way around > research articles classified as web articles). From what I remember, this classification came from PLoS people, so I am not sure if there are more of these cases in there.

For the cases that have a new DOI, I found that some of them do not get found by our query, some are found and some are not found because of hyphen.. So the papers spell "open-hardware" and not "open hardware" or "open-source hardware" etc. Maybe we would gain from adding these instances to the query.

The same is true for the DOIs from PLoS one you listed above. (also these DOIs, in their current format, do not lead to the article webpage when I past them to the address bar on my navigator).

solstag commented 3 years ago

Ni! Cool, excellent, thanks!

Hm, I would be surprised if Scopus or WOS would not find "open-hardware" from "open hardware", I was assuming they treat hyphens as spaces. Did you confirm that?

amchagas commented 3 years ago

ok, so you are correct, both databases manage differences between "open hardware" and "open-hardware" and so does Scielo

amchagas commented 3 years ago

I forgot to follow up on my comment: If hyphenation is not the issue for some of them getting found and some not, then what could be the issue? I could not think of anything obvious...

(Could be that the ones that are not found are not indexed in those databases somehow?)

I guess the question is, do we want to do a deep dive in this issue, or do we acknowledge it exists and move on?

solstag commented 3 years ago

The issue is that these open hardware papers may not use any terms to designate open hardware in their title+abstracts+keywords, or we might not have the good terms.

A ideia seria para os 10 fazer uma tabela com:

  1. quais dos nossos termos tão presentes
  2. se ele seria achado pela nossa query ou não
  3. quais outros termos usa para referir que trata de OH
amchagas commented 3 years ago

some results are in :P from the ten papers I found some keywords that we might want to test: 3 papers use Open source design ~100 hits at WOS 1 paper uses open source method 50 hits at WOS 1 paper uses open source tool 1201 hits at WOS (and quite some software papers) 1 paper uses open source electronics 62 hits at WOS 1 paper uses inexpensive hardware (using this could lead to a lot of papers that describe affordable solutions but not necessarily open source)

Other than that, one of the papers does not actually shares the needed data for replication and is more focused on a biological question, rather than the description of a tool method.

Other two papers mention likely keywords only in the introduction, they are open hardware design and open source hardware

One paper has no mentions to open source whatsoever, even though code and design files are nicely placed in GH.

All of these do not have our terms in the title and/or abstract

I have saved a table with this info under the /data folder


amchagas commented 3 years ago

A quick look at WOS shows that "open source design" puts out about ~100 entries. Some are not hardware, but there are also some hardware articles we did not find before

a possible keyword combination is open source 3D printed which outputs 20 articles in WOS and all of them are about hardware.

solstag commented 3 years ago

Ni! Ok, I've checked the table. It answers question 3 (quais outros termos usa para referir que trata de OH), but it doesn't answer questions 1 and 2 (quais dos nossos termos tão presentes / se ele seria achado pela nossa query ou não). So we still don't know whether our current search would catch those. Or am I missing something? Cheeeers

amchagas commented 3 years ago

you are not missing something. I missed to write down specifically: On the ods file there is an "observations" column where I made comments to some of the papers, things like "our keywords are only present in the introduction" - so not in abstract, title or keywords. The rows where no observations are made, our keywords were not found on the paper. This is because the plos collection was made by people either submitting things to it (so the authors knew their paper matched the collection requirements, even though they did not use the keywords we are now looking for), or by us finding papers "by chance"...

solstag commented 3 years ago

So, basically, you're saying that none of those 10 papers would show up in our current search? Well that's pretty bad.

EDIT: sorry, paying more attention to the keywords (and not only the observations) I guess two of them would show up because they contain "open hardware". It's still bad though. The upside is that by adding "open source design" we'd cover half of them.

amchagas commented 3 years ago

I think it would be useful to check again how many papers are in the plos collection that are not in our query? I mean are these 10 selected papers representing a big number of papers that are not caught by our query? Or was it more a fluke ? In any case, we can make a note of that on the writing, and try to assess the size of the issue? We are going to have to go through more entries manually anyway I think to check a statistical significant subsample of the papers, so we might learn more when we do that...

solstag commented 3 years ago

This was a random sample of "papers from the plos collection having plos DOIs" (:X). So it should be somewhat representative of the plos papers. I'm going to directly check our current data for all the DOIs in X, and see what we find in those papers absent from it. But it's clear we are leaving a lot of - and still possibly most! - papers out.

solstag commented 3 years ago

Ni! Done:

The conclusion is that these numbers makes it hard to argue that BIBLIO is representative.

We can try to add "design" to the query and see how this improves, given that in our sample of 10 so many used that. I'm committing that change to the query generator in project_definitions.py. Is it too much work to regenerate the RIS files?

But I'm not very confident that it will improve the situation enough. I'd hope to find at least half of the PLOSC stuff in BIBLIO. I'm trying to think of what else can we do. Maybe we could add "open source method" and "open source electronics" like you suggested. Or include "open source tools" but conditioned on record also mentioning "hardware" or "electronics".

Cheers!

amchagas commented 3 years ago

Just did a new search using the new definitions for Scoupus, Wos and Scielo.. For Scopus, exported only "articles", avoiding "proceedings", "book chapter" etc. since the number of entries was quite high and exporting limited to 2000 entries (could export every type, but would have to filter, select etc).

solstag commented 3 years ago

Ni! There's something wrong with the new files. Files "scielo.ciw" and "wos1-500.ciw" are the same. Can you check?

amchagas commented 3 years ago

Ups! should be fixed now...

solstag commented 3 years ago

Ni!

  1. Ok, big mistake, turns out "open design" brings in a lot of unrelated medical literature, so we have to exclude it and only keep only the other combinatinos with "design" : P

  2. I've checked the 10 sampled articles and found something strange: with the new search, we get articles 2, 3 and 6 (your table index). Those are the articles you tagged with "open source design". But we do not get the articles 7 and 10, respectively with "open hardware" and "open source hardware". That's weird!

So I checked with the earlier data and it seems we weren't getting 7 and 10 before either. Then I went looking into the contents of the articles, and it turns out you marked them using the full-text. The problem is that Scopus and Wos, and probably Scielo as well, only let us search on the title+abstract+keywords.

  1. The only way to large scale search full text is through CORE, but it's restricted to open access sources, mostly from institutional repositories: https://core.ac.uk/

That could be a different approach, less usual, but perhaps easier. We can say that we limit ourselves to open access publications available through CORE because searching the full text is more reliable, even if CORE coverage is messy to define. We'll miss both paywalled and OA content not in CORE. I checked and Sensors is consistently there because it's in PMC, PLOS seems to depend on institutional repositories but does get indexed full text; HardwareX papers can be found but it's worse as it denies full text indexing.

  1. With some recent improvements in the code and the current search we got from 19 to 28 abstracts in both BIBLIO and PLOSC. But that's still pretty low. I'm thinking we might want some kind of AI to learn from the abstract if the paper is introducing an open source hardware. I'm thinking that we should keep improving our search until we get to about a third of the papers in PLOSC, at which point we may publish something. And later we can use the data from that to train a machine learning model.

So, do you think you can play some more with the searches, maybe we want to include some more search statements like ("open ource" AND "electronics"), which will catch papers containing both terms even if they're not contiguous, or even "DIY".

  1. In any case, since I've spent some hours improving the ETL, I'm going to ask you to replay the searches again with the current new definition which excludes "open design" (see bbf3cd762108cf061385606130de3299bfdd2016), to see what we get after processing. Ok? Also, if you try some other stuff like I suggested in the previous paragraph, and you figure it's good, you can add that too. And yes, do export only "articles".

Abraço,