amchagas / open-hardware-supply

having a closer look on how OSH papers are evolving over time
MIT License
5 stars 2 forks source link

pulling data from WoS throws an error #18

Closed amchagas closed 2 months ago

amchagas commented 5 months ago

Sometimes the code on method-2 to pull metadata from WoS throws an error:

Traceback (most recent call last):
  File "/home/andre/repositories/open-hardware-supply/code/method2-scholarly/1_collect_wos_entries.py", line 37, in <module>
    wosTtr.download_records()
  File "/home/andre/repositories/open-hardware-supply/code/method2-scholarly/get_wos_records_from_api.py", line 175, in download_records
    downloaded_data_keys = set(
  File "/home/andre/repositories/open-hardware-supply/code/method2-scholarly/get_wos_records_from_api.py", line 176, in <genexpr>
    self.data_key(json.loads(x)["scraped"]) for x in mrf.readlines()
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 18904 (char 18903)

Namely the script 1_collect_wos_entries.py

I have dug around a bit to try to find out what is causing it, but failed to do so, as I cannot find anything wrong with the files generated, or with the files used to retrieve data from WoS..

amchagas commented 4 months ago

After some digging, I realised that this error occurred when the initial files used to feed this part of the analysis were malformed. This traced back to the stage of using Scholarly for scraping data from Google Scholar.

Practically this means that if 0_collect_gscholar_data.py failed or got interrupted, this generated an entry that was "corrupted" and this would lead to errors when running the next stage of the pipeline.

Given that some keywords lead to a lot of data to be scrapped (which would lead to long scrapping times and increase the chance of bad entries), I broke down the scrapping based on year, so instead of running the code once to get all data from all years to a single file, there is now one file for each year per term. In other words, "open labware" folder has one folder for each year in the interval from 2005-2023.

This led to the scraping code running for a smaller time and therefore having less chances to create entries that would not play well with the next stage of analysis

solstag commented 4 months ago

Interesting. Thanks! Though I don't see any changes in the latest commits related to years, other than a strange:

-YEARS = tuple(range(2005, 2024))
+YEARS = tuple(range(2017, 2018))

Is this on a different branch? Looking at the code, it seems to download the data by year since 6a91257b (March 2023)

amchagas commented 4 months ago

ah, sorry, not expressing myself properly.. It does have code to segment the query per year... but it was all being put in the same output file, so if it ran from years, 2005 to 2010, and than somewhere along year 2011 something got corrupted, it would make the subsequent routine throw an error...

since running this one script for a long time to scrape all years at once led to a very long run time, it increased the chance for errors.... Given that I wanted to finish collection of this part of the data as soon as possible, I manually ran things to run on a year per year basis (I know not the best approach), but this was only for both terms that had a very large number of hits (Open hardware and Open source hardware)....

solstag commented 4 months ago

I still don't get it. That commit was already storing each year, or each month of each year, in a different file. See these lines: https://github.com/amchagas/open-hardware-supply/commit/6a91257b5712b5f1c7faead6f5db0689f0dec30c#diff-1afc8f2273aece479eb75d16c6127d316bdba61baf91c04885ebc13c027fda97R88 https://github.com/amchagas/open-hardware-supply/commit/6a91257b5712b5f1c7faead6f5db0689f0dec30c#diff-1afc8f2273aece479eb75d16c6127d316bdba61baf91c04885ebc13c027fda97R94

amchagas commented 4 months ago

right, sorry. I am making this super confusing.

I think my mistake in explanation is related to which analysis step was giving out problems... Whenever I tried to run 1_collect_wos_entries.py I got the error stated on the beginning of this issue... Which I solved (manually 🤦 - my bad) by splitting the outputs of the previous script into different folders (within each term, by year - or range of years, if there weren't that many hits).

I should have documented this earlier (🤦) so that things would be fresher in my mind, but I also ran the previous script (0_collect_gscholar_data.py) in smaller steps, and stored things as mentioned in different folders. This led to 1_collect_wos_entries.py running properly.... I'll take a look at things again and try to remember any other steps that I might have forgotten... sorry about this..

in worst case that I have nothing else to add here, one outcome of this change in analysis workflow is that now all collection from GS and WoS is done...

solstag commented 4 months ago

Ni! No prob! This all seems very good, I was just trying to understand what is going on.

So essentially what happened is that you re-downloaded the data (re-queried Google Scholar) and now there is no more malformed data. Is that it?

And perhaps that you believe that downloading with more time passing between each step, by manually intervening for every term or for every year, helped avoid errors and thus malformed data?

Then you also split the data in many folders, but I don't see how this could have helped with the malformed data.