Open solstag opened 3 years ago
I worked on the original file that lives in https://github.com/amchagas/open-source-toolkit. Namely classified all entries as hardware, software or both... Now it should be easy to filter out only the hardware ones.
Sorry I am only seeing that you working on the data now.. would you mind giving a little bit more detail on what you have done so far?
ACK, since it's such a small file, I'll update the code to download it from that repository instead of keeping a copy. We'll talk details on Monday (:
Ni! André (@amchagas), when you find the time, can you work a bit more on the plos-items.csv
? Specifically:
Most addresses in the DOI field are already DOIs or very close to a DOI (like plos URLs that contain the DOI), only 46 are not. Can you check if those 46 have DOIs on the page in the URLs, and in that case replace them by the DOIs?
As you do that, can you take note of whether our search query would match those articles (title+abstract+keywords)?
Also, here's a sample of 10 pone dois from the collection, can you also check if they'd get picked by our search query?
index doi
350 10.1371/journal.pone.0187219
405 10.1371/journal.pone.0059840
407 10.1371/journal.pone.0030837
393 10.1371/journal.pone.0118545
310 10.1371/journal.pone.0206678
388 10.1371/journal.pone.0143547
295 10.1371/journal.pone.0220751
398 10.1371/journal.pone.0107216
281 10.1371/journal.pone.0226761
338 10.1371/journal.pone.0193744
PS: I think I've explained the situation in DM, but if you wanna see what I'm seeing check out https://github.com/amchagas/open-hardware-supply/commit/b2cab2793c0ee2cf7b25c802cbe48263344bd787 .~´
Hi @solstag !
just worked a bit more on the csv file, it is uploaded to the repo.
to keep track of which entries were missing DOIs, I added a new column "missing DOIs" and added the missing information there. I managed to get 32 DOIs, but in the process of doing so, noticed that there was some missclassified entries. For instance, some things that should have been classified as "web articles" were classified as "research articles" (the page showing the GOSH manifesto was one of them). counted 14 of these missclassifications (but note that in this number are also things that were the other way around > research articles classified as web articles). From what I remember, this classification came from PLoS people, so I am not sure if there are more of these cases in there.
For the cases that have a new DOI, I found that some of them do not get found by our query, some are found and some are not found because of hyphen.. So the papers spell "open-hardware" and not "open hardware" or "open-source hardware" etc. Maybe we would gain from adding these instances to the query.
The same is true for the DOIs from PLoS one you listed above. (also these DOIs, in their current format, do not lead to the article webpage when I past them to the address bar on my navigator).
Ni! Cool, excellent, thanks!
Hm, I would be surprised if Scopus or WOS would not find "open-hardware" from "open hardware", I was assuming they treat hyphens as spaces. Did you confirm that?
ok, so you are correct, both databases manage differences between "open hardware" and "open-hardware" and so does Scielo
I forgot to follow up on my comment: If hyphenation is not the issue for some of them getting found and some not, then what could be the issue? I could not think of anything obvious...
(Could be that the ones that are not found are not indexed in those databases somehow?)
I guess the question is, do we want to do a deep dive in this issue, or do we acknowledge it exists and move on?
The issue is that these open hardware papers may not use any terms to designate open hardware in their title+abstracts+keywords, or we might not have the good terms.
A ideia seria para os 10 fazer uma tabela com:
some results are in :P
from the ten papers I found some keywords that we might want to test:
3 papers use Open source design
~100 hits at WOS
1 paper uses open source method
50 hits at WOS
1 paper uses open source tool
1201 hits at WOS (and quite some software papers)
1 paper uses open source electronics
62 hits at WOS
1 paper uses inexpensive hardware
(using this could lead to a lot of papers that describe affordable solutions but not necessarily open source)
Other than that, one of the papers does not actually shares the needed data for replication and is more focused on a biological question, rather than the description of a tool method.
Other two papers mention likely keywords only in the introduction, they are open hardware design
and open source hardware
One paper has no mentions to open source whatsoever, even though code and design files are nicely placed in GH.
All of these do not have our terms in the title and/or abstract
I have saved a table with this info under the /data
folder
A quick look at WOS shows that "open source design" puts out about ~100 entries. Some are not hardware, but there are also some hardware articles we did not find before
a possible keyword combination is open source 3D printed
which outputs 20 articles in WOS and all of them are about hardware.
Ni! Ok, I've checked the table. It answers question 3 (quais outros termos usa para referir que trata de OH), but it doesn't answer questions 1 and 2 (quais dos nossos termos tão presentes / se ele seria achado pela nossa query ou não). So we still don't know whether our current search would catch those. Or am I missing something? Cheeeers
you are not missing something. I missed to write down specifically: On the ods file there is an "observations" column where I made comments to some of the papers, things like "our keywords are only present in the introduction" - so not in abstract, title or keywords. The rows where no observations are made, our keywords were not found on the paper. This is because the plos collection was made by people either submitting things to it (so the authors knew their paper matched the collection requirements, even though they did not use the keywords we are now looking for), or by us finding papers "by chance"...
So, basically, you're saying that none of those 10 papers would show up in our current search? Well that's pretty bad.
EDIT: sorry, paying more attention to the keywords (and not only the observations) I guess two of them would show up because they contain "open hardware". It's still bad though. The upside is that by adding "open source design" we'd cover half of them.
I think it would be useful to check again how many papers are in the plos collection that are not in our query? I mean are these 10 selected papers representing a big number of papers that are not caught by our query? Or was it more a fluke ? In any case, we can make a note of that on the writing, and try to assess the size of the issue? We are going to have to go through more entries manually anyway I think to check a statistical significant subsample of the papers, so we might learn more when we do that...
This was a random sample of "papers from the plos collection having plos DOIs" (:X). So it should be somewhat representative of the plos papers. I'm going to directly check our current data for all the DOIs in X, and see what we find in those papers absent from it. But it's clear we are leaving a lot of - and still possibly most! - papers out.
Ni! Done:
The conclusion is that these numbers makes it hard to argue that BIBLIO is representative.
We can try to add "design" to the query and see how this improves, given that in our sample of 10 so many used that. I'm committing that change to the query generator in project_definitions.py. Is it too much work to regenerate the RIS files?
But I'm not very confident that it will improve the situation enough. I'd hope to find at least half of the PLOSC stuff in BIBLIO. I'm trying to think of what else can we do. Maybe we could add "open source method" and "open source electronics" like you suggested. Or include "open source tools" but conditioned on record also mentioning "hardware" or "electronics".
Cheers!
Just did a new search using the new definitions for Scoupus, Wos and Scielo.. For Scopus, exported only "articles", avoiding "proceedings", "book chapter" etc. since the number of entries was quite high and exporting limited to 2000 entries (could export every type, but would have to filter, select etc).
Ni! There's something wrong with the new files. Files "scielo.ciw" and "wos1-500.ciw" are the same. Can you check?
Ups! should be fixed now...
Ni!
Ok, big mistake, turns out "open design" brings in a lot of unrelated medical literature, so we have to exclude it and only keep only the other combinatinos with "design" : P
I've checked the 10 sampled articles and found something strange: with the new search, we get articles 2, 3 and 6 (your table index). Those are the articles you tagged with "open source design". But we do not get the articles 7 and 10, respectively with "open hardware" and "open source hardware". That's weird!
So I checked with the earlier data and it seems we weren't getting 7 and 10 before either. Then I went looking into the contents of the articles, and it turns out you marked them using the full-text. The problem is that Scopus and Wos, and probably Scielo as well, only let us search on the title+abstract+keywords.
That could be a different approach, less usual, but perhaps easier. We can say that we limit ourselves to open access publications available through CORE because searching the full text is more reliable, even if CORE coverage is messy to define. We'll miss both paywalled and OA content not in CORE. I checked and Sensors is consistently there because it's in PMC, PLOS seems to depend on institutional repositories but does get indexed full text; HardwareX papers can be found but it's worse as it denies full text indexing.
So, do you think you can play some more with the searches, maybe we want to include some more search statements like ("open ource" AND "electronics")
, which will catch papers containing both terms even if they're not contiguous, or even "DIY"
.
Abraço,
Two perspectives:
In doing this I may be tempted to refactor data loading a little bit.